Lecture 02
Lecture 02
We can define the lexicographical order using the concept of the longest
common prefix.
Definition 1.8: The length of the longest common prefix of two strings
A[0..m) and B[0..n), denoted by lcp(A, B), is the largest integer
` ≤ min{m, n} such that A[0..`) = B[0..`).
Definition 1.9: Let A and B be two strings over an alphabet with a total
order ≤, and let ` = lcp(A, B). Then A is lexicographically smaller than or
equal to B, denoted by A ≤ B, if and only if
1. either |A| = `
2. or |A| > `, |B| > ` and A[`] < B[`].
29
An important concept for sets of strings is the LCP (longest common
prefix) array and its sum.
Definition 1.10: Let R = {S1 , S2 , . . . , Sn } be a set of strings and assume
S1 < S2 < · · · < Sn . Then the LCP array LCPR [1..n] is defined so that
LCPR [1] = 0 and for i ∈ [2..n]
LCPR [i] = lcp(Si , Si−1 )
Furthermore, the LCP array sum is
X
ΣLCP (R) = LCPR [i] .
i∈[1..n]
Example 1.11: For R = {ali$, alice$, anna$, elias$, eliza$}, ΣLCP (R) = 7
and the LCP array is:
LCPR
0 ali$
3 alice$
1 anna$
0 elias$
3 eliza$
30
A variant of the LCP array sum is sometimes useful:
Definition 1.12: For a string S and a string set R, define
lcp(S, R) = max{lcp(S, T ) | T ∈ R}
X
Σlcp(R) = lcp(S, R \ {S})
S∈R
The relationship of the two measures is shown by the following two results:
Lemma 1.13: For i ∈ [2..n], LCPR [i] = lcp(Si , {S1 , . . . , Si−1 }).
Lemma 1.14: ΣLCP (R) ≤ Σlcp(R) ≤ 2 · ΣLCP (R).
The proofs are left as an exercise.
The concept of distinguishing prefix is closely related and often used in place
of the longest common prefix for sets. The distinguishing prefix of a string
is the shortest prefix that separates it from other strings in the set. It is
easy to see that dp(S, R \ S) = lcp(S, R \ S) + 1 (at least for a prefix free R).
Example 1.15: For R = {ali$, alice$, anna$, elias$, eliza$}, Σlcp(R) = 13
and Σdp(R) = 18.
31
Theorem 1.16: The number of nodes in trie(R) is exactly
||R|| − ΣLCP (R) + 1, where ||R|| is the total length of the strings in R.
The proof reveals a close connection between LCPR and the structure of
the trie. We will later see that LCPR is useful as an actual data structure in
its own right.
32
String Sorting
Ω(n log n) is a well known lower bound for the number of comparisons
needed for sorting a set of n objects by any comparison based algorithm.
This lower bound holds both in the worst case and in the average case.
There are many algorithms that match the lower bound, i.e., sort using
O(n log n) comparisons (worst or average case). Examples include quicksort,
heapsort and mergesort.
On the other hand, the average number of symbol comparisons for two
random strings is O(1). Does this mean that we can sort a set of random
strings in O(n log n) time using a standard sorting algorithm?
33
The following theorem shows that we cannot achieve O(n log n) symbol
comparisons for any set of strings (when σ = no(1) ).
Theorem 1.17: Let A be an algorithm that sorts a set of objects using
only comparisons between the objects. Let R = {S1 , S2 , . . . , Sn } be a set of n
strings over an ordered alphabet Σ of size σ. Sorting R using A requires
Ω(n log n logσ n) symbol comparisons on average, where the average is taken
over the initial orders of R.
• Note that the theorem holds for any comparison based sorting algorithm
A and any string set R. In other words, we can choose A and R to
minimize the number of comparisons and still not get below the bound.
• Only the initial order is random rather than “any”. Otherwise, we could
pick the correct order and use an algorithm that first checks if the order
is correct, needing only O(n + ΣLCP (R)) symbol comparisons.
• Thus A needs to do Ω(nα log nα) string comparisons and Ω(knα log nα)
symbol comparisons to determine the relative order of the strings in Rα .
P
Thus the total number of symbol comparisons is Ω α∈Σk knα log nα and
X √
√ n− n √ √
knα log nα ≥ k(n − n) log k
≥ k(n − n) log( n − 1)
σ
α∈Σk
= Ω (kn log n) = Ω (n log n logσ n) .
√ P √
Here we have used the facts that σ k ≤ n, that nα > n − σ k ≥ n− n,
P P P k
α∈Σk
Proof. If we are given the strings in the correct order and the job is to
verify that this is indeed so, we need at least ΣLCP (R) symbol
comparisons. No sorting algorithm could possibly do its job with less symbol
comparisons. This gives a lower bound Ω(ΣLCP (R)).
On the other hand, the general sorting lower bound Ω(n log n) must hold
here too.
• Note that the expected value of ΣLCP (R) for a random set of n
strings is O(n logσ n). The lower bound then becomes Ω(n log n).
We will next see that there are algorithms that match this lower bound.
Such algorithms can sort a random set of strings in O(n log n) time.
36
String Quicksort (Multikey Quicksort)
Here is a variant of quicksort that partitions the input into three parts
instead of the usual two parts.
37
In the normal, binary quicksort, we would have two subsets R≤ and R≥ , both
of which may contain elements that are equal to the pivot.
The time complexity of both the binary and the ternary quicksort depends
on the selection of the pivot (exercise).
38
String quicksort is similar to ternary quicksort, but it partitions using a single
character position. String quicksort is also known as multikey quicksort.
39
Example 1.21: A possible partitioning, when ` = 2.
al p habet al i gnment
al i gnment al g orithm
al l ocate al i as
al g orithm al l ocate
=⇒
al t ernative al l
al i as al p habet
al t ernate al t ernative
al l al t ernate
40
Proof of Theorem 1.22. The time complexity is dominated by the symbol
comparisons on lines (4)–(6). We charge the cost of each comparison either
on a single symbol or on a string depending on the result of the comparison:
41
Radix Sort
The Ω(n log n) sorting lower bound does not apply to algorithms that use
stronger operations than comparisons. A basic example is counting sort for
sorting integers.
Algorithm 1.23: CountingSort(R)
Input: (Multi)set R = {k1 , k2 , . . . kn } of integers from the range [0..σ).
Output: R in nondecreasing order in array J[0..n).
(1) for i ← 0 to σ − 1 do C[i] ← 0
(2) for i ← 1 to n do C[ki ] ← C[ki ] + 1
(3) sum ← 0
(4) for i ← 0 to σ − 1 do // cumulative sums
(5) tmp ← C[i]; C[i] ← sum; sum ← sum + tmp
(6) for i ← 1 to n do // distribute
(7) J[C[ki ]] ← ki ; C[ki ] ← C[ki ] + 1
(8) return J
42
Similarly, the Ω(ΣLCP (R) + n log n) lower bound does not apply to string
sorting algorithms that use stronger operations than symbol comparisons.
Radix sort is such an algorithm for integer alphabets.
Radix sort was developed for sorting large integers, but it treats an integer
as a string of digits, so it is really a string sorting algorithm.
MSD radix sort starts sorting from the beginning of strings (most
significant digit).
LSD radix sort starts sorting from the end of strings (least
significant digit).
43
The LSD radix sort algorithm is very simple.
Algorithm 1.24: LSDRadixSort(R)
Input: (Multi)set R = {S1 , S2 , . . . , Sn } of strings of length m over alphabet [0..σ).
Output: R in ascending lexicographical order.
(1) for ` ← m − 1 to 0 do CountingSort(R,`)
(2) return R
It is easy to show that after i rounds, the strings are sorted by suffix of
length i. Thus, they are fully sorted at the end.
44
The algorithm assumes that all strings have the same length m, but it can
be modified to handle strings of different lengths (exercise).
Theorem 1.26: LSD radix sort sorts a set R of strings over the alphabet
[0..σ) in O(||R|| + mσ) time, where ||R|| is the total length of the strings in
R and m is the length of the longest string in R.
Proof. Assume all strings have length m. The LSD radix sort performs m
rounds with each round taking O(n + σ) time. The total time is
O(mn + mσ) = O(||R|| + mσ).
• The weakness of LSD radix sort is that it uses Ω(||R||) time even when
ΣLCP (R) is much smaller than ||R||.
45
MSD radix sort resembles string quicksort but partitions the strings into σ
parts instead of three parts.
46
Algorithm 1.28: MSDRadixSort(R, `)
Input: (Multi)set R = {S1 , S2 , . . . , Sn } of strings over the alphabet [0..σ)
and the length ` of their common prefix.
Output: R in ascending lexicographical order.
(1) if |R| < σ then return StringQuicksort(R, `)
(2) R⊥ ← {S ∈ R | |S| = `}; R ← R \ R⊥
(3) (R0 , R1 , . . . , Rσ−1 ) ← CountingSort(R, `)
(4) for i ← 0 to σ − 1 do Ri ← MSDRadixSort(Ri , ` + 1)
(5) return R⊥ · R0 · R1 · · · Rσ−1
• Here CountingSort(R,`) not only sorts but also returns the partitioning
based on symbols at position `. The time complexity is still O(|R| + σ).
• The recursive calls eventually lead to a large number of very small sets,
but counting sort needs Ω(σ) time no matter how small the set is. To
avoid the potentially high cost, the algorithm switches to string
quicksort for small sets.
47
Theorem 1.29: MSD radix sort sorts a set R of n strings over the
alphabet [0..σ) in O(ΣLCP (R) + n log σ) time.
Proof. Consider a call processing a subset of size k ≥ σ:
• The time excluding the recursive calls but including the call to counting
sort is O(k + σ) = O(k). The k symbols accessed here will not be
accessed again.
• At most dp(S, R \ {S}) ≤ lcp(S, R \ {S}) + 1 symbols in S will be
accessed by the algorithm. Thus the total time spent in this kind of
calls is O(Σdp(R)) = O(Σlcp(R) + n) = O(ΣLCP (R) + n).
The calls for a subsets of size k < σ are handled by string quicksort. Each
string is involved in at most one such call. Therefore, the total time over all
calls to string quicksort is O(ΣLCP (R) + n log σ).
• There exists a more complicated variant of MSD radix sort with time
complexity O(ΣLCP (R) + n + σ).
• Ω(ΣLCP (R) + n) is a lower bound for any algorithm that must access
symbols one at a time (simple string model).
• In practice, MSD radix sort is very fast, but it is sensitive to
implementation details.
48