0% found this document useful (0 votes)

6 views16 pages

A Fast Su X-Sorting Algorithm: X X, - . - , X, - . - , Q

This document presents a fast algorithm for lexicographically sorting all suffixes of a sequence using a combination of counting and sorting techniques. The proposed algorithm operates in O(n) space and O(n^2 log n) time in the worst case, with an initial sorting step that efficiently handles the prefix of each suffix. The paper also discusses the sorting of symbols and the median position search for three elements as part of the overall sorting process.

Uploaded by

juxeiier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views16 pages

A Fast Su X-Sorting Algorithm: X X, - . - , X, - . - , Q

Uploaded by

juxeiier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

A Fast Suﬃx-Sorting Algorithm

R. Ahlswede, B. Balkenhol, C. Deppe, and M. Fröhlich

1 Introduction

We present an algorithm to sort all suﬃxes of xn = (x1 , . . . , xn ) ∈ X n lexico-

graphically, where X = {0, . . . , q−1}. Fast and efficient sorting of a large amount
of data according to its suffix structure (suffix-sorting) is a useful technology in
many fields of application, front-most in the field of Data Compression where
it is used e.g. for the Burrows and Wheeler Transformation (BWT for short), a
block-sorting transformation ([3],[9]).
Larsson [4] describes the relationship between the BWT on one hand and
suffix trees and context trees on the other hand. Then Sadakane [8] suggests a
well referenced method to compute the BWT more time efficiently. Then the
algorithms based on suffix trees have been improved ([6],[5],[1]).
In [3] it was observed that for an input string of size n, this transformation can
be computed in O(n) time and space1 using suffix trees. While suffix trees are
considered to be greedy in space – even small factors hidden in the O-notation
may decide on the feasibility of an algorithm – sorting was accomplished by
alternative non-linear methods: Manber and Myers [7] introduced an algorithm
of O(n log n) in worst case time and 8n bytes of space and in [2] an algorithm
based on Quicksort is suggested, which is fast on the average but its worst case
complexity is O(n2 log n). Most prominent in this case is the Bendson-Sedgewick
Algorithm which requires 4n bytes and Sadakane’s example of a combination of
the Manber-Myers Algorithm with the Bendson-Sedgewick Algorithm with a
complexity of O(nlogn) worst case time using 9n bytes [8].
The reduction of the space requirement due to an upper bound on n seems
trivial. However, it turns out that it involves a considerable amount of engineer-
ing work to achieve an improvement, while retaining an acceptable worst case
time complexity. This paper proposes an algorithm, efficient in the terms de-
scribed above, ideal for handling large blocks of input data. We assume that the
cardinality of the alphabet (q) is smaller than the text-string (n). Our algorithm
computes the suffix sorting in O(n) space and O(n2 log n) time in the worst case.
It has also the property that it sorts the suffixes lexicographically according to
the prefixes of length t2 = logq n2 in the worst case in linear time. After the ini-
tial sorting of length t2 , we use a Quick-sort-variant to sort the remaining part.
Therefore we get the worst time O(n2 log n). It is also possible to modify our
algorithm by using Heap-sort. Then we will get a worst case time O(n(log n)2 ).
1
This only holds, if the space complexity of a counter or pointer is considered to be
constant (e.g. 4 Bytes) and the basic operations on them (increment, comparison)
are constant in time. This assumption is common in the literature and helpful for
practical purposes.

R. Ahlswede et al. (Eds.): Information Transfer and Combinatorics, LNCS 4123, pp. 719–734, 2006.

c Springer-Verlag Berlin Heidelberg 2006
720 R. Ahlswede et al.

We use Quick-sort, because it is better in practice and has an average time of

O(n log n) like Heap-sort, but with a smaller factor.
The elements of X are called symbols. We denote the symbols by their rank
w.r.t. the order on X . We assume that $ = q − 1 ∈ X is a symbol not occurring
in the ﬁrst n − 1 symbols in xn , the sentinel symbol.
xi is the ith element in the sequence xn . If i ≤ j, then (xi , . . . , xj ) is the factor
of xn beginning with the ith element and ending with the jth element. If i > j,
then (xi , . . . , xj ) is the empty sequence. A factor v of x begins at position i and
ends at position j in x if (xi , . . . , xj ) = v. To conveniently refer to the factors of
a sequence, we use the abbreviation xji for (xi , . . . , xj ).

2 The Initial Sorting Step

Before we tackle the problem of sorting all suffixes of a given sequence in lexi-
cographical order we start to consider the case where we only sort the suffixes
looking at the prefixes of a fixed length correctly. The simplest case is to look at
all prefixes of length one, which is the case to sort all symbols occurring in the
input sequence lexicographically.

2.1 Sorting of the Symbols

The sorting of the symbols of a given input sequence xn with symbols out of a
ﬁnite alphabet X can be done linearly in time and space complexity as follows:
We deﬁne q counters (counter0 [0], . . . , counter0 [q − 1]) and count for each sym-
bol in {0, . . . , q − 1} how often it occurs in xn . In each step i we have to increase
exactly one counter (counter0 [xi ]) by one. Therefore to get the frequencies of the
symbols requires O(n) operations. Now our alphabet is given in lexicographic or-
der and we generate the output in the following way: First output counter0 [0]
many zeros, followed by counter0 [1] many ones,. . . Obviously the generated out-
put sequence is produced in O(n) operations and the sorting is done.

2.2 Sorting a Given Prefix Length

We would like to continue the sorting of all suffixes in an iterative way by using
the counting idea of the previous section. In a later step of the algorithm we
need n counters. We have to take the memory already at the beginning, which
allows us to use it already in the initial sorting phase. We choose t1 such that
2t1 −1 < q ≤ 2t1 and t2 such that 2t1 t2 ≤ n2 < 2t1 (t2 +1) . For simplicity we
assume from now on that q = 2t1 and n = 2t1 t2 +1 .
We like to sort all suffixes such that the first t2 symbols of each suffix are
sorted lexicographically correctly.
Now we will count the number of occurrences of factors of length t2 in our
sequence xn . We assume that xn+1 , . . . , xn+t2 −1 = q − 1 and count the factors
as follows. The counter[a1 k t2 −1 + a1 k t2 −2 + · · · + at2 k 0 ] counts the number of
occurrences of the factor (a1 , . . . , at2 ). Let us define a temporary value tmp =
2t1 t2 − 1 and i = n. This is the position n of the sequence, with the factor
A Fast Suffix-Sorting Algorithm 721

(q − 1, . . . , q − 1). Now starting at the end down to the beginning of the input
sequence xn+t2 −1 in each step we increase counter[tmp] by one, decrease i by
one and we calculate:

tmp
tmp → + xi 2t1 (t2 −1) .
2t1
Notice that multiplications and divisions by powers of two can be represented
by shifts. Let us denote
a
a >> b = b and a << b = a2b .
2
Furthermore notice that the + operation can be replaced by a binary logical
or-operation which we denote as |. Hence in total we need O(n)
operations.
By construction tmp will only take values less than n2 = (n >> 1), such
that we can calculate the partial sums of the entries counter[j] and store them
in the second half of the memory for the array counter
n
2 −1

n
counter[ + j] → counter[i].
2 i=0

Obviously this calculation can also be done linearly in time:

i->1
counter[(n>>1)]->0
while i< (n>>1) do
counter[(n>>1)+i] -> counter[(n>>1)+i-1] + counter[i-1]
i-> i+1
done
Finally we have to write back the result of the sorting. In order to continue we
introduce two further arrays of size n, one, which we denote as pointer, in order
to describe the starting points of the suﬃxes, and the second one, denoted as
index, to store the partial results of the sorting.
Again we start with tmp = 2t1 t2 − 1 and at position i = n.
while i>n-t_2 do
i->i-1
tmp->(tmp>>t_1)|x_i<<t_1(t_2-1)
counter[tmp]->counter[tmp]-1
index[i]->counter[tmp+(n>>1)]+counter[tmp];
pointer[index[i]]->i;
done
while i>0 do
i->i-1
tmp->(tmp>>t_1)|x_i<<t_1(t_2-1)
counter[tmp]->counter[tmp]-1
index[i]->counter[tmp+(n>>1)]
pointer[index[i]+counter[tmp]]->i
done
722 R. Ahlswede et al.

In the first loop we consider the cases where we have to take the sentinel into
account (we assume that xn+i = $). With the starting definition of tmp the
sentinel will be taken as a number greater or equal to |X | − 1. Using the fact
that it occurs only at the end of the sequence, that is with the largest entry of
count, we can fix the position of the last t2 entries to the starting-point of the
prefix of suffixes, represented as integer tmp at that moment, plus the number
of occurrences of that value tmp. In all other cases (second loop) we set index[i]
to the starting position of the interval of suffixes with prefix tmp.
In other words after these loops pointer[1], . . . , pointer[n] represent the start-
ing positions of the suffixes in lexicographical order according to the prefix-
es of length t2 . If index[pointer[i]] < index[pointer[j]] (index[pointer[i]] >
index[pointer[j]]) then the suffix starting in pointer[i] is lexicographically small-
er (larger) than the suffix starting in position pointer[j]. If the two values are
equal, then the two suffixes have a common prefix of length greater or equal
to t2 .
Notice that to finish the lexicographic order in total we can continue using the
two arrays pointer and index only, that is there is no need to look at the original
input sequence to calculate the defined total order, such that the continuation
is independent of the alphabet size.

3 Only Three Elements

In order to continue the sorting we ﬁrst analyze how to sort and how to calculate
the median of three given numbers.

3.1 Median-Position-Search of Three Elements

The median m of a triple (n1 , n2 , n3 ) ∈ N30 is a value equal to at least one of

them which is in between the two others, i.e.

m = n1 ⇒ n2 ≤ n1 ≤ n3 or n3 ≤ n1 ≤ n2 ,

m = n2 ⇒ n1 ≤ n2 ≤ n3 or n3 ≤ n2 ≤ n1 ,
m = n3 ⇒ n2 ≤ n3 ≤ n1 or n1 ≤ n3 ≤ n2 .
Notice that we are not interested in the value itself, only in the position relative
to the two others, i.e. for us there is no difference between the case (1, 1, 1) and
(2, 2, 2). Therefore we partition the set of triples in the following way. We define
13 subsets A1 , . . . , A13 ⊂ N30 in the following way: For k ∈ N0 and l, m ∈ N we
define
A1 = {(k, k, k)} A8 = {(k, k + l, k + l + m)}
A2 = {(k, k, k + l)} A9 = {(k, k + l + m, k + l)}
A3 = {(k, k + l, k)} A10 = {(k + l, k, k + l + m)}
A4 = {(k + l, k, k)} A11 = {(k + l, k + l + m, k)}
A Fast Suffix-Sorting Algorithm 723

A5 = {(k, k + l, k + l)} A12 = {(k + l + m, k, k + l)}

A6 = {(k + l, k, k + l)} A13 = {(k + l + m, k + l, k)}
A7 = {(k + l, k + l, k)}.
For a given triple (n1 , n2 , n3 ) the median is known to us, if we know the
index i with (n1 , n2 , n3 ) ∈ Ai . Therefore we deﬁne the following questionnaire of
yes–no–questions where a question is of the following form: a ≤ b, a < b, a = b.

if n_1 <= n_2 then

if n_2 <= n_3 then m=n_2
else if n_1 <= n_3 then m=n_3
else m=n_1
endif
endif
else
if n_3 <= n_2 then m=n_2
else if n_1 <= n_3 then m=n_1
else m=n_3
endif
endif
endif

Notice that we need at most three yes-no-questions and we need only two in
case where the median is already in the middle.

3.2 Sorting of Three Elements

Using questions of the form mentioned in the previous section we can sort three
elements using at most four questions:

if n_1 <= n_2 then

if n_2 <= n_3 then
if n_1 = n_2 then
if n_2 = n_3 then (n_1,n_2,n_3) in A_1
else (n_1,n_2,n_3) in A_2
endif
else
if n_2 = n_3 then (n_1,n_2,n_3) in A_5
else (n_1,n_2,n_3) in A_8
endif
endif
else
if n_1 <= n_3 then
if n_1 = n_3 then (n_1,n_2,n_3) in A_3
else (n_1,n_2,n_3) in A_9
endif
724 R. Ahlswede et al.

else
if n_1 = n_2 then (n_1,n_2,n_3) in A_7
else (n_1,n_2,n_3) in A_11
endif
endif
endif
else
if n_1 > n_3 then
if n_2 = n_3 then (n_1,n_2,n_3) in A_4
else if n_2 < n_3 then (n_1,n_2,n_3) in A_12
else (n_1,n_2,n_3) in A_13
endif
endif
else
if n_1 = n_3 then (n_1,n_2,n_3) in A_6
else (n_1,n_2,n_3) in A_10
endif
endif
endif

4 The Main Loop of the Sorting Algorithm

After the initial sorting phase we have the array pointer, which points to the
starting positions of the suffixes lexicographically correctly sorted according to
the prefixes of length t2 . index contains the partial ordering, that is if the values
are different, then the larger one is lexicographically larger than the smaller one,
if they are equal then the two suffixes have a common prefix of length greater or
equal to t2 . Finally we can calculate with the second half of the array counter
the positions of the intervals with common prefixes of length t2 . We use now
counter[0] to count the number of intervals where we have to continue with the
sorting, more precisely counter[0] points to the first free place in memory where
we can store a further interval, which is in the beginning 1 (counter[1] is free).
counter[0]->1
Starting the loop to get the not necessarily correctly sorted intervals counter[0]
is initialized with 1 because we need it in this way later and we are working on
“unsigned int”.
i->0
while i< 2^(t_1*t_2) do
if counter[(n>>1)+i+1]-counter[(n>>1)+i]>1 then
counter[counter[0]+1]->counter[(n>>1)+i];
counter[counter[0]+2]->counter[(n>>1)+i+1]-1
while index[pointer[counter[counter[0]+2]]]>
index[pointer[counter[counter[0]+1]]] do
counter[counter[0]+2]->counter[counter[0]+2]-1
A Fast Suffix-Sorting Algorithm 725

done
if counter[counter[0]+2]!=counter[counter[0]+1] then
counter[0]->counter[0]+2
endif
endif
done

Notice that during the loop we reuse the memory in counter from (n >> 1)
to n.

4.1 Split an Interval

We have to sort an interval from position begin to end that is pointer[begin] to
pointer[end] has to be sorted but they are already of equal length length. We
like to do the sorting by a 3 part quick-sort. The array ‘smaller’ contains all
pointers which are smaller than the first entry (smaller defined by index !), the
array ‘equal’ the pointers which are equal in the first 2length positions with the
first one and the array ‘bigger’ the remaining ones. After we have split this part
we have to continue with ’smaller’ and ’bigger’ of length length and with ’equal’
of length 2 · length. These intervals (starting point, end point) we return to the
calling function using two arrays x and y.
Given a value val, the index for the interval stored in counter at positions
counter[val − 1] and counter[val], the value length which is the length of the
common prefix already known from the previous steps (after the initial sorting
it is t2 ) and a flag f lag which describes whether the intervals are stored at the
beginning of counter or at the end (after the initial part at the beginning).
Now the beginning of the interval is given by begin = counter[val − 1] and
the end position by end = counter[val]. Notice that the last length pointers of
the original sequence can not occur inside this interval because they are correct-
ly inserted in one of the previous steps due to the (virtual) sentinel symbol at
the end of the input sequence. Therefore if we look at the suffixes starting at
pointer[begin] and pointer[end], then we know they have by construction a com-
mon prefix of length at least length. But if we look at the two suffixes without
the prefix of length length, then theses two suffixes have been sorted correctly
also according to the prefix of length length. In other words the result of the
comparison of the two pointers pointer[begin] and pointer[end] is equal to the
result of the comparison of pointer[begin]+length and pointer[end]+length. We
can get the result by using the values stored in the array index. Let us denote
that a is lexicographic smaller than b with a ≺ b for two pointers a, b where a
pointer is smaller than another one if the corresponding suffix starting at that
pointer is lexicographic smaller than the other suffix. Then

pointer[begin] ≺ pointer[end] ⇔

pointer[begin] + length ≺ pointer[end] + length.

Therefore if now index[pointer[begin] + length] = index[pointer[end] + length]
then the suﬃxes starting at pointer[begin] and pointer[end] have a common
726 R. Ahlswede et al.

preﬁx of length at least 2 · length. Otherwise we can use the result to get the
right comparison result. Notice that in this way we double the length of the
comparison in each step.
Now for a given interval we like to split the interval into several parts similar to
quick-sort. Therefore we take three values and calculate the median as mentioned
in Section 3.1
n_1->index[pointer[begin]+length];
n_2->index[pointer[(begin+end)>>1]+length];
n_3->index[pointer[end]+length];

median-> (n_1 <= n_2 ?

(n_2 <= n_3 ? n_2 : (n_1 <= n_3 ? n_3 : n_1 ) )
: (n_3 <= n_2 ? n_2 : (n_1 <= n_3 ? n_1 : n_3 )))
With currentindex = index[pointer[begin]] we have the value of index[pointer[i]]
for all begin ≤ i ≤ end. Now we like to split the interval into three parts, one for
the pointers which are smaller than the median one for those which are equal and
one for those which are larger. We divide the parts by changing the values of the
pointers as follows:
First we need two further variables which we set to begin and end respectively.
s->begin
b->end
And we need yet another variable k for the actual position inside the interval.
As long as the values of index[pointer[k] + length] < median and k ≤ b the
current end of the interval we increase k by one:
k->begin; /* the starting point */
while index[pointer[k]+length]<median && k<=b do
k->k+1
done
s->k;
We set s to the actual value of k such that s points to the first position which
is greater or equal to the median. In a similar way we reduce b at the end, give
the first pointer which is less or equal than the median.
while index[pointer[b]+length]>median && k<=b do
b->b-1
done
Remember that we have stopped the first loop in a case where
index[pointer[k] + length] ≥ median
and the second one where
index[pointer[b] + length] ≤ median.
A Fast Suffix-Sorting Algorithm 727

Now let us continue in the following way:

if index[pointer[k]+length]>median then
SWAPPOINTER(k,b)
b->b-1

where we denote with SW AP P OIN T ER(k, b) the following operations:

tmp− > pointer[k] pointer[k]− > pointer[b] pointer[b]− > tmp

such that the two values are simply exchanged. Now we have that
index[pointer[k] + length] ≤ median and we continue:
if index[pointer[k]+length]=median then
k->k+1
while index[pointer[k]+length]=median do
k->k+1
done
else
k->k+1 s->s+1
endif
else
k->k+1
while index[pointer[k]+length]=median && k<=b do
k->k+1
done
endif
Now if s > begin then the part from begin to s − 1 stores the pointers which are
smaller than the median and if b < end then the part from b + 1 to end are the
pointers which are larger than the median. Furthermore if s < k then the part
from s to k − 1 are pointers which are equal to the median. Let us ﬁrst continue
with the case where s = k:

if s=k then
s->end+1 /* we make the value impossible, in other */
/* words larger then end */
while k<=b && s>end do
if index[pointer[k]+length]<median then
k->k+1 /* one further pointer which is smaller */
else
if index[pointer[k]+length]>median then
SWAPPOINTER(k,b);
b->b-1 /* add to bigger interval */
else
s->k /* s is getting a value <= end and */
/* the loop stops. */
k->k+1 /* they are equal */
728 R. Ahlswede et al.

endif
endif
done
endif

Now we have found at least one pointer which is equal to the median. We have
to continue similarly as before but if index[pointer[k] + length] < median then
we have to exchange in addition the pointers in positions k and s and we have to
increase also s. Furthermore the only stop situation for the loop occurs if k > b.

while k<=b do
if index[pointer[k]+length]<median then
SWAPPOINTER(k,s);
k->k+1
s->s+1
else
if index[pointer[k]+length]>median then
SWAPPOINTER(k,b);
b->b-1 /* add to bigger */
else
k->k+1 /* they are equal */
endif
endif
done
Now we have the three parts

begin, . . . , s − 1, the pointers which are smaller

s, . . . , b, the pointers which are equal

b + 1, . . . , end, the pointers which are larger.
If s − 1 < begin or b + 1 > end then the corresponding intervals are empty. In
order to use these parts in the future, we have to update the values of index for
the current pointers. Notice that equal to the median means that they have a
common preﬁx of a length at least 2 · length.
For the ﬁrst interval (if it exists) nothing has to be done, because the values
of index are already at the starting point of the interval. The new starting point
of the second part is
currentindex->currentindex+s-begin

Of course the second part contains at least one pointer by construction (the
pointer which is used to calculate the median has a common preﬁx to itself !).
if s>begin && s<=b then
k->s
while k<=b do
A Fast Suﬃx-Sorting Algorithm 729

index[pointer[k]]->currentindex;
k->k+1
done
endif

Finally we have to calculate the starting point of the last interval (if it exists)

currentindex->currentindex+b+1-s;

if b+1<=end then
k->b+1
while k<=end do
index[pointer[k]]->currentindex
k->k+1
done
endif

Now we have to continue with our sorting algorithm on the constructed inter-
vals. But before we start to consider the interval from s to b of length 2 · length we
like to finish all intervals of length length in order to double the compared lengths
of the prefixes again. For that reason we store that interval at the opposite end
of the array counter on which we are working at the moment. After the initial
part we are working at the beginning to store our intervals, such that we store
the interval from s to b at the end. After we have finished all intervals which we
have to compare of length length we start to work at the end with the intervals
sorted correctly of length 2 · length and store all intervals we produce of length
4 · length at the beginning. Notice that the total number of intervals we have to
store is always less than n such that if we need more space at the end it is free at
the beginning of the array counter and vice versa. To add these intervals we define
a function IN ST OCOU N T ER(F ROM, T O, F LAG) where F ROM ,T O are the
boundaries of the interval which we have to add and F LAG describes where. If we
are working at the end of counter we use counter[n] similarly to counter[0] for the
beginning part. To delete one interval at the end we have to increase counter[n]
such that we need two different rules to add an interval at the end:

INSTOCOUNTER(FROM,TO,FLAG) {
switch(FLAG) {
case 0: {
counter[counter[0]]->(FROM);
counter[0]->counter[0]+1
counter[counter[0]]->(TO);
counter[0]->counter[0]+1
break;
}
case 1: {
counter[n]->counter[n]-1
counter[counter[n]]_>(TO);
730 R. Ahlswede et al.

counter[n]->counter[n]-1
counter[counter[n]]->(FROM);
break;
}
default: { /* case 2 */
counter[counter[n]]->(FROM);
counter[counter[n]+1]->(TO);
counter[n]->counter[n]-2;
break;
}
}
}

Now the insertion of the intervals using the function IN ST OCOU N T ER can
be done as follows:

if s-begin>1 then
INSTOCOUNTER(begin,s-1,2-(flag<<1))
endif
if b-s>0 then
INSTOCOUNTER(s,b,flag)
endif
if end-b>1 then
INSTOCOUNTER(b+1,end,2-(flag<<1))
endif

4.2 Calling the Sorting Procedure

To conclude the description of the whole algorithm it remains to describe the
step between the initial sorting phase and the calling of the procedure to split a
given interval.
We are starting in a situation where we have given the three arrays counter,
pointer and index and we know, that if we use the values stored in index as
rule for the comparison of two pointers then the result is correct according to
the first t2 symbols (from the initial sorting part).
As mentioned earlier we like to use the array counter from both sides. At the
beginning we use a variable length which describes the length of the common
prefix correctly sorted. This variable is initialized with t2 from the initial sorting
phase. In order to double the length in each loop we have to use the information
stored in index to sort all suffixes according to the first 2 · length symbols
correctly. After that we use the information to double the length again and so
on. counter[0] is already used to describe the first free position in memory at
the beginning of counter. Analogously we use counter[n] in order to do the same
procedure at the end. Therefore we have to store at the same time intervals sorted
with prefixes of length length and of length 2 · length. If there is no further one
of length length we start to sort them of length 2 · length and produce new ones
of length 4 · length. Notice that the total number of intervals can not be more
A Fast Suffix-Sorting Algorithm 731

then n >> 1 such that to store them with starting and ending point we need at
most n values in the memory. Furthermore out of the initial sorting part some
pointers at the end of the input sequence (exactly t2 many) are already correctly
sorted such that the memory requirement is strictly less than n − 2 (we make
an initial sorting at least of length 2). For typical files we need only something
like n >> 2 entries in memory, but in worst case n − t2 is needed as we can see
by the following example:
Take a deBruijn sequence of length 2n−1 copy the sequence and concatenate the
two. The property of a deBruijn sequence is, that if we are looking at a linear
shift-register of length n − 1 then these sequences have maximal period, or more
precisely every binary sequence of length n − 1 occurs exactly once. Now if we
have a length of t2 = n − 1 then each prefix occurs in the constructed sequence
exactly twice and hence we have n intervals from which only n − 1 are getting
correctly sorted at the initial phase.
Now at the beginning we have no interval to sort of length 2 · length:
counter[n]->n;
We are starting the main loop.
/* as long as there is something to compare */
while(counter[0]>1) do
/* starting with the beginning part (at the end) */
We call this loop twice because first we like to sort every interval of length
length correctly, after that we continue at the end of counter and sort the
intervals of length 2 · length. If there are further intervals of length 4 · length
then we can find them at the beginning of counter.

/* as long as we have something to compare of length "length" */

while counter[0]>1 do
counter[0]->counter[0]-2

switch(counter[counter[0]+1]-counter[counter[0]]) {
/* +1 is the number of elements ! */

Notice that using the procedure of Section 4.1 the calculation of the median is
only eﬃcient if we have enough elements to sort. Therefore in case where we
have intervals of a small length we sort directly:
With only two entries we need in the worst case two questions in order to sort
them
case 1: { /* only two entries */
m1->index[pointer[counter[counter[0]]]+k]
/* a shortcut to store them in order not */
m2->index[pointer[counter[counter[0]+1]]+k]
/* to calculate them twice */
if m1=m2 then
732 R. Ahlswede et al.

The two values are equal, that means the two suﬃxes are equal of length
”2*lengthänd therefore we add it at the end of counter.

INSTOCOUNTER(counter[counter[0]],
counter[counter[0]+1],1)
else

They are diﬀerent so that we can compare them

if m1<m2 then

The beginning value of the interval is smaller than the end, therefore we do not
have to exchange the order and we can update the index.

index[pointer[counter[counter[0]+1]]]->
index[pointer[counter[counter[0]+1]]]+1;
else

We have to swap them and to update the index of the beginning pointer.

SWAPPOINTER(counter[counter[0]],
counter[counter[0]+1])
index[counter[counter[0]]]->
index[counter[counter[0]]]+1
endif
endif
break;
} /* end of case interval of length 2 */

An interval with three elements we can sort as described in Section 3.2. We

call a function sort3 which needs as parameters the array counter, the position
(counter[counter[0]]) in counter to get the boundaries for the interval to sort, the
arrays pointer and index, a f lag which describes how to insert a new interval to
continue with, the length of the already compared preﬁxes and ﬁnally the length
n (necessary to insert a new interval using the function IN ST OCOU N T ER).

case 2: { /* interval of length 3 */

sort3(counter,counter[counter[0]],pointer,
index,1,length,n)

Either everything is sorted or we are getting an interval back which starts with
the same ﬁrst 2 · length symbols and that is we have to add them to the end of
counter.

break;
} /* end of interval of length 3 */

In all other cases we call the function described in Section 4.1 which we denote
as splitcount.
A Fast Suﬃx-Sorting Algorithm 733

default: { /* the general case */

splitcount(counter,counter[0]+1,pointer,index,
1,length,n);
break;
} /* end of the general case */
} /* end of the switch */

Now we can stop the loop for sorting intervals with length length and look at
the intervals of length 2 · length.

done /* inner loop: counter[0]>1 */

length->(length<<1)

The length llengthı̈s ﬁnished, that is we can continue with ”2*lengthı̈n order not
to copy the end to the beginning and continue the main loop we repeat the whole
procedure with exchanging the role of the beginning of the array counter and
the end of it. Of course counter[0] = 1, in other words at the beginning there
is no interval of 4 · length which we have to compare. Now we have to start the
loop at the end:

while counter[n]<n do
switch(counter[counter[n]+1]-counter[counter[n]]) {
/* +1 is the number of elements ! */
case 1: { /* only two elements */
/* two shortcuts */
m1=index[pointer[counter[counter[n]]]+length];
m2=index[pointer[counter[counter[n]+1]]+length];
if m1=m2 then

The two values are equal and we have to add a new interval at the beginning of
the array counter using the function IN ST OCOU N T ER.

INSTOCOUNTER(counter[counter[n]],
counter[counter[n]+1],0);
else /* we can compare them */
if m1<m2 then
index[pointer[counter[counter[n]+1]]]++;
else
SWAPPOINTER(counter[counter[n]],
counter[counter[n]+1]);
index[counter[counter[n]]]++;
endif
endif
break;
} /* end of the case with only two elements. */

As before we also consider a separate case with only three elements using the
function sort3 as before but with f lag = 0.
734 R. Ahlswede et al.

case 2: {
sort3(counter,counter[counter[n]],pointer,
index,0,length,n);
break;
} /* and of case 2. */
Again in all others cases we use the function splitcount.
default: {
splitcount(counter,counter[n]+1,pointer,
index,0,length,n);
break;
}
} /* and of the switch */
/* continue with the next interval. */
counter[n]->counter[n]+2
done /* end of the loop counter[n]<n */
length->(length<<1) /* double again and return to the */
/* first loop: counter[0]>1. */
done

References
1. B. Balkenhol and S. Kurtz, Space efficient linear time computation of the Burrows
and Wheeler transformation, Number, Information and Complexity, Special volume
in honour of R. Ahlswede on occasion of his 60th birthday, editors I. Althöfer, N.
Cai, G. Dueck, L. Khachatrian, M. Pinsker, A. Sárközy, I. Wegener, and Z. Zhang,
Kluwer Acad. Publ., Boston, Dordrecht, London, 375–384, 1999.
2. J. Bentley and R. Sedgewick, Fast algorithm for sorting and searching strings, Pro-
ceedings of the ACM–SIAM Symposium on Discrete Algorithms, 360–369, 1997.
3. M. Burrows and D.J. Wheeler, A block–sorting lossless data compression algorithm,
Technical report, Digital Systems Research Center, 1994.
4. N.J. Larsson, The context trees of block sorting compression, Proceedings of the
IEEE Data Compression Conference, Snowbird, Utah, March 30 – April 1, IEEE
Computer Society Press, 189–198, 1998.
5. N.J. Larsson, Structures of string matching and data compression, PhD thesis, Dept.
of Computer Science, Lund University, 1999.
6. N.J. Larsson and K. Sadakane, Faster suffix–sorting, Technical Report LU–CS–TR:
99–214, LUNDFD6/(NFCS–3140)/1–20/(1999), Dept. of Computer Science, Lund
University, 1999.
7. U. Manber and E.W. Myers, Suffix arrays: A new method for on–line string searches,
SIAM Journal on Computing, 22, 5, 935–948, 1993.
8. K. Sadakane, A fast algorithm for making suffix arrays and for Burrows–Wheeler
transformation, Proceedings of the IEEE Data Compression Conference, Snowbird,
Utah, March 30 – April 1, IEEE Computer Society Press, 129–138, 1998.
9. M. Schindler, A fast block–sorting algorithm for lossless data compression, Proceed-
ings of the Conference on Data Compression, 469, 1997.

Turbo Decoding Using SOVA
100% (1)
Turbo Decoding Using SOVA
87 pages
Medical Image Processing QB
100% (1)
Medical Image Processing QB
3 pages
Operational Research: Simplex Method
No ratings yet
Operational Research: Simplex Method
8 pages
Lecture 3
No ratings yet
Lecture 3
35 pages
Unit 6 2
No ratings yet
Unit 6 2
108 pages
Suffix
No ratings yet
Suffix
29 pages
Edusat Lect
No ratings yet
Edusat Lect
127 pages
2L Efficiency Search Sort
No ratings yet
2L Efficiency Search Sort
52 pages
Unit-2 Heuristic Search Techniques
No ratings yet
Unit-2 Heuristic Search Techniques
56 pages
Lecture 5
No ratings yet
Lecture 5
120 pages
2100 2122 8 Sorting in Linear Time
No ratings yet
2100 2122 8 Sorting in Linear Time
29 pages
Suffix Array
No ratings yet
Suffix Array
71 pages
Week 2
No ratings yet
Week 2
35 pages
Better External Memory Suffix Array Construction-05
No ratings yet
Better External Memory Suffix Array Construction-05
14 pages
5-Selection Sort-Linear-Time
No ratings yet
5-Selection Sort-Linear-Time
43 pages
Mc4301 APR May 24 (Machine Learning)
No ratings yet
Mc4301 APR May 24 (Machine Learning)
3 pages
Assignment Analysis Design of Algorithm Anshika Chauhan 0103CS191041
No ratings yet
Assignment Analysis Design of Algorithm Anshika Chauhan 0103CS191041
51 pages
FM 072
No ratings yet
FM 072
20 pages
Seminar 2
No ratings yet
Seminar 2
20 pages
Unit 2.3 Bucket - Radix - Counting - Sorting
No ratings yet
Unit 2.3 Bucket - Radix - Counting - Sorting
32 pages
DAA - Unit IV - Space and Time Tradeoffs - Lecture Slides
No ratings yet
DAA - Unit IV - Space and Time Tradeoffs - Lecture Slides
41 pages
Mathematics 6.2 LabManual
No ratings yet
Mathematics 6.2 LabManual
14 pages
LIPIcs CPM 2016 23
No ratings yet
LIPIcs CPM 2016 23
12 pages
Lecture03 PDF
No ratings yet
Lecture03 PDF
22 pages
L5 Dsa
No ratings yet
L5 Dsa
25 pages
Maths - IIA Important Questions
100% (2)
Maths - IIA Important Questions
9 pages
Countingsort
No ratings yet
Countingsort
8 pages
Lec 1
No ratings yet
Lec 1
8 pages
Subset and N Queens Problem
No ratings yet
Subset and N Queens Problem
15 pages
03 Sorting
No ratings yet
03 Sorting
57 pages
Lecture04 SuffixArray
No ratings yet
Lecture04 SuffixArray
5 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
03 Sorting
No ratings yet
03 Sorting
57 pages
Draft 1
No ratings yet
Draft 1
6 pages
MS 101: Algorithms: Instructor Neelima Gupta Ngupta@cs - Du.ac - in
No ratings yet
MS 101: Algorithms: Instructor Neelima Gupta Ngupta@cs - Du.ac - in
28 pages
Active Low-Pass Filter Design-UXX
No ratings yet
Active Low-Pass Filter Design-UXX
8 pages
Unit 5
No ratings yet
Unit 5
78 pages
Radixsort
No ratings yet
Radixsort
20 pages
Better External Memory Suffix Array Construction: Roman Dementiev, Juha K Arkk Ainen, Jens Mehnert, Peter Sanders
No ratings yet
Better External Memory Suffix Array Construction: Roman Dementiev, Juha K Arkk Ainen, Jens Mehnert, Peter Sanders
12 pages
ANN Viva Prep
No ratings yet
ANN Viva Prep
66 pages
Announcements: Weekly Reading Assignment: Chapter 9 (CLRS) (Not On This Exam)
No ratings yet
Announcements: Weekly Reading Assignment: Chapter 9 (CLRS) (Not On This Exam)
17 pages
Lower Bound For Sorting, Radix Sort: COMP171 Fall 2005
No ratings yet
Lower Bound For Sorting, Radix Sort: COMP171 Fall 2005
20 pages
Sorting 2
No ratings yet
Sorting 2
26 pages
Simple Linear Work Su X Array Construction: Abstract. A Su X Array Represents The Su Xes of A String in Sorted
No ratings yet
Simple Linear Work Su X Array Construction: Abstract. A Su X Array Represents The Su Xes of A String in Sorted
13 pages
Chapter7 Part4 Radix Counting
No ratings yet
Chapter7 Part4 Radix Counting
16 pages
Assignment Overheads
No ratings yet
Assignment Overheads
33 pages
CS 332: Algorithms: Linear-Time Sorting Algorithms
No ratings yet
CS 332: Algorithms: Linear-Time Sorting Algorithms
24 pages
Advanced Control Theory
No ratings yet
Advanced Control Theory
4 pages
Suffix Arrays
No ratings yet
Suffix Arrays
20 pages
Class Notes DL Unit 1
No ratings yet
Class Notes DL Unit 1
2 pages
Mca Assignment
No ratings yet
Mca Assignment
27 pages
Suffix Array Tutorial
No ratings yet
Suffix Array Tutorial
17 pages
Assign5 Solution
No ratings yet
Assign5 Solution
4 pages
Practice Final 09
No ratings yet
Practice Final 09
8 pages
Lecture 3: Simple Sorting and Searching Algorithms: Data Structure and Algorithm Analysis
No ratings yet
Lecture 3: Simple Sorting and Searching Algorithms: Data Structure and Algorithm Analysis
39 pages
Suffix Arrays: Justin Zhang 24 May 2017
No ratings yet
Suffix Arrays: Justin Zhang 24 May 2017
5 pages
Data Structures Using C (Modern College5)
50% (2)
Data Structures Using C (Modern College5)
51 pages
Quicksort: - Sort An Array A (P R) - Divide
No ratings yet
Quicksort: - Sort An Array A (P R) - Divide
50 pages
Performance Analysis of Machine Learning Algorithms For Prediction of Liver Disease
No ratings yet
Performance Analysis of Machine Learning Algorithms For Prediction of Liver Disease
7 pages
Midterm F07 Solutions
No ratings yet
Midterm F07 Solutions
4 pages
Dpp-1 Classification of Polynomial
No ratings yet
Dpp-1 Classification of Polynomial
3 pages
CSC323 Module 2 Classical Design Techniques NEW
No ratings yet
CSC323 Module 2 Classical Design Techniques NEW
64 pages
Empirical
No ratings yet
Empirical
8 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Algorithms and Complexity - Sorting - Part1-1
No ratings yet
Algorithms and Complexity - Sorting - Part1-1
8 pages
Clearancefactor Crestfactor Impulsefactor Kurtosis Mean Peakvalue Rms
No ratings yet
Clearancefactor Crestfactor Impulsefactor Kurtosis Mean Peakvalue Rms
4 pages
Data Structures and Algorithms (COMP232)
No ratings yet
Data Structures and Algorithms (COMP232)
5 pages
Brief Introduction of Data Structure
No ratings yet
Brief Introduction of Data Structure
36 pages
Operations Research (17ME81) Notes and Video Links From Sadashiv Bellubbi
No ratings yet
Operations Research (17ME81) Notes and Video Links From Sadashiv Bellubbi
2 pages
Data Structure
No ratings yet
Data Structure
36 pages
Chapter 2
No ratings yet
Chapter 2
6 pages
DSP Assignment
No ratings yet
DSP Assignment
2 pages
DSEB65 Homework2
No ratings yet
DSEB65 Homework2
3 pages
Complex ICA-R
No ratings yet
Complex ICA-R
8 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Chapter 2
No ratings yet
Chapter 2
6 pages
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
Welcome To ISLP Documentation! - Introduction To Statistical Learning (Python)
No ratings yet
Welcome To ISLP Documentation! - Introduction To Statistical Learning (Python)
8 pages
DS&A-Chapter Two
No ratings yet
DS&A-Chapter Two
5 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Infinite Sequences and Series
From Everand
Infinite Sequences and Series
Konrad Knopp
3.5/5 (3)
Basic Exercises for Competitive Programming: Python
From Everand
Basic Exercises for Competitive Programming: Python
Jan Pol
No ratings yet
Space and Time Trade Off
No ratings yet
Space and Time Trade Off
8 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
ODL Course Readiness Certificate No: - Date
No ratings yet
ODL Course Readiness Certificate No: - Date
3 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

A Fast Su X-Sorting Algorithm: X X, - . - , X, - . - , Q

Uploaded by

A Fast Su X-Sorting Algorithm: X X, - . - , X, - . - , Q

Uploaded by

A Fast Suﬃx-Sorting Algorithm

R. Ahlswede, B. Balkenhol, C. Deppe, and M. Fröhlich

We present an algorithm to sort all suﬃxes of xn = (x1 , . . . , xn ) ∈ X n lexico-

We use Quick-sort, because it is better in practice and has an average time of

2 The Initial Sorting Step

2.1 Sorting of the Symbols

2.2 Sorting a Given Prefix Length

Obviously this calculation can also be done linearly in time:

3 Only Three Elements

3.1 Median-Position-Search of Three Elements

The median m of a triple (n1 , n2 , n3 ) ∈ N30 is a value equal to at least one of

A5 = {(k, k + l, k + l)} A12 = {(k + l + m, k, k + l)}

if n_1 <= n_2 then

3.2 Sorting of Three Elements

if n_1 <= n_2 then

4 The Main Loop of the Sorting Algorithm

4.1 Split an Interval

pointer[begin] + length ≺ pointer[end] + length.

median-> (n_1 <= n_2 ?

Now let us continue in the following way:

where we denote with SW AP P OIN T ER(k, b) the following operations:

tmp− > pointer[k] pointer[k]− > pointer[b] pointer[b]− > tmp

begin, . . . , s − 1, the pointers which are smaller

s, . . . , b, the pointers which are equal

4.2 Calling the Sorting Procedure

/* as long as we have something to compare of length "length" */

They are diﬀerent so that we can compare them

An interval with three elements we can sort as described in Section 3.2. We

case 2: { /* interval of length 3 */

default: { /* the general case */

done /* inner loop: counter[0]>1 */

You might also like