Lecture 03
Lecture 03
Lecture 03
• The return value is the pair (x, `), where x ∈ {<, =, >} indicates the
order, and ` = lcp(A, B), the length of the longest common prefix of
strings A and B.
• The input value k is the length of a known common prefix, i.e., a lower
bound on lcp(A, B). The comparison can skip the first k characters.
The extra time spent in the comparison is balanced by the extra information
obtained in the form of the lcp value.
49
The following result shows how we can use the information from earlier
comparisons to obtain a lower bound or even the exact value for an lcp.
Lemma 1.30: Let A, B and C be strings.
Intuitively, the above result makes sense if you think of lcp(·, ·) as a measure
of similarity between two strings. The higher the lcp, the closer the two
strings are lexicographically.
51
String Mergesort
String mergesort is a string sorting algorithm that uses lcp-comparisons. It
has the same structure as the standard mergesort: sort the first half and the
second half separately, and then merge the results.
Algorithm 1.32: StringMergesort(R)
Input: Set R = {S1 , S2 , . . . , Sn } of strings.
Output: R sorted and augmented with LCPR values.
(1) if |R| = 1 then return ((S1 , 0))
(2) m ← bn/2c
(3) P ← StringMergesort({S1 , S2 , . . . , Sm })
(4) Q ← StringMergesort({Sm+1 , Sm+2 , . . . , Sn })
(5) return StringMerge(P, Q)
53
Lemma 1.34: StringMerge performs the merging correctly.
Proof. We will show that the following invariant holds at the beginning of
each round in the loop on lines (2)–(12):
The invariant is clearly true in the beginning. We will show that the invariant
is maintained and the smaller string is chosen in each round of the loop.
Proof. If the calls to LcpCompare took constant time, the time complexity
would be O(n log n) by the same argument as with the standard mergesort.
55
String Binary Search
56
We can use the lcp-comparison technique to improve binary search for
strings. The following is a key result.
57
During the binary search of P in {S1 , S2 , . . . , Sn }, the basic situation is the
following:
• We have already compared P against Slef t and Sright, and we know that
Slef t ≤ P, Smid ≤ Sright .
58
Algorithm 1.38: String binary search (without precomputed lcps)
Input: Ordered string set R = {S1 , S2 , . . . , Sn }, query string P .
Output: The number of strings in R that are smaller than P .
(1) lef t ← 0; right ← n + 1
(2) llcp ← 0 // llcp = lcp(Slef t , P )
(3) rlcp ← 0 // rlcp = lcp(P, Sright )
(4) while right − lef t > 1 do
(5) mid ← b(lef t + right)/2c
(6) mlcp ← min{llcp, rlcp}
(7) (x, mlcp) ← LcpCompare(P, Smid , mlcp)
(8) if x = “ < ” then right ← mid; rlcp ← mclp
(9) else lef t ← mid; llcp ← mclp
(10) return lef t
59
We can improve the worst case complexity by choosing the midpoint closer
to the smaller lcp value:
• If llcp − rlcp > 1, choose the middle position closer to the right.
• If rlcp − llcp > 1, choose the middle position closer to the left in a
symmetric way.
• The lower bound on string binary searching time has been shown to be
m log log n
Θ + m + log n .
log log m log log n
log n
60
Algorithm 1.39: Skewed string binary search (without precomputed lcps)
Input: Ordered string set R = {S1 , S2 , . . . , Sn }, query string P .
Output: The number of strings in R that are smaller than P .
(1) lef t ← 0; right ← n + 1
(2) llcp ← 0 // llcp = lcp(Slef t , P )
(3) rlcp ← 0 // rlcp = lcp(P, Sright )
(4) while right − lef t > 1 do
(5) if llcp − rlcp > 1 then
(6) d ← llcp − rlcp
(7) mid ← d((ln(d + 1)) · lef t + d · right)/(d + ln(d + 1))e
(8) else if rlcp − llcp > 1 then
(9) d ← rlcp − llcp
(10) mid ← b(d · lef t + (ln(d + 1)) · right)/(d + ln(d + 1))c
(11) else
(12) mid ← b(lef t + right)/2c
(13) mlcp ← min{llcp, rlcp}
(14) (x, mlcp) ← LcpCompare(P, Smid , mlcp)
(15) if x = “ < ” then right ← mid; rlcp ← mclp
(16) else lef t ← mid; llcp ← mclp
(17) return lef t
61
The lower bound above assumes that no other information besides the
ordering of the strings is given. We can further improve string binary
searching by using precomputed information about the lcp’s between the
strings in R.
Consider again the basic situation during string binary search:
In the unskewed algorithm, the values lef t and right are fully determined by
mid independently of P . That is, P only determines whether the search ends
up at position mid at all, but if it does, lef t and right are always the same.
Thus, we can precompute and store the values
LLCP [mid] = lcp(Slef t , Smid )
RLCP [mid] = lcp(Smid , Sright )
Now we know all lcp values between P , Slef t , Smid , Sright except lcp(P, Smid ).
The following lemma shows how to utilize this.
62
Lemma 1.40: Let A, B, B 0 and C be strings such that A ≤ B ≤ C and
A ≤ B 0 ≤ C.
Proof. Cases (a)–(d) are symmetrical, we show (a). B < B 0 follows from
Lemma 1.31. Then by Lemma 1.30, lcp(A, B 0 ) = min{lcp(A, B), lcp(B, B 0 )}.
Since lcp(A, B 0 ) < lcp(A, B), we must have lcp(A, B 0 ) = lcp(B, B 0 ).
64
Theorem 1.42: An ordered string set R = {S1 , S2 , . . . , Sn } can be
preprocessed in O(ΣLCP (R) + n) time and O(n) space so that a binary
search with a query string P can be executed in O(|P | + log n) time.
Proof. The values LLCP [mid] and RLCP [mid] can be computed in
O(lcp(Smid , R \ {Smid }) + 1) time. Thus the arrays LLCP and RLCP can be
computed in O(Σlcp(R) + n) = O(ΣLCP (R) + n) time and stored in O(n)
space.
The main while loop in Algorithm 1.41 is executed O(log n) times and
everything except LcpCompare on line (11) needs constant time.
65
Hashing and Fingerprints
Hashing is a powerful technique for dealing with strings based on mapping
each string to an integer using a hash function:
H : Σ∗ → [0..q) ⊂ N
The most common use of hashing is with hash tables. Hash tables come in
many flavors that can be used with strings as well as with any other type of
object with an appropriate hash function. A drawback of using a hash table
to store a set of strings is that they do not support lcp and prefix queries.
Hashing is also used in other situations, where one needs to check whether
two strings S and T are the same or not:
When used this way, the hash value is often called a fingerprint, and its
range [0..q) is typically large as it is not restricted by a hash table size.
66
Any good hash function must depend on all characters. Thus computing
H(S) needs Ω(|S|) time, which can defeat the advantages of hashing:
• The main strength of hash tables is the support for constant time
insertions and queries, but for example inserting a string S into a hash
table needs Ω(|S|) time when the hash computation time is included.
Compare this to the O(|S|) time for a trie under a constant alphabet
and the O(|S| + log n) time for a ternary trie.
• A hash value can be computed once, stored, and used many times.
• Some hash functions can be computed more efficiently for a related set
of strings. An example is the Karp–Rabin hash function.
67
Definition 1.43: The Karp–Rabin hash function for a string
S = s0 s1 . . . sm−1 over an integer alphabet is
H(S) = (s0 rm−1 + s1 rm−2 + · · · + sm−2 r + sm−1 ) mod q
for some fixed positive integers q and r.
Lemma 1.44: For any two strings A and B,
H(AB) = (H(A) · r|B| + H(B)) mod q
H(B) = (H(AB) − H(A) · r|B| ) mod q
Proof. Without the modulo operation, the result would be obvious. The
modulo does not interfere because of the rules of modular arithmetic:
(x + y) mod q = ((x mod q) + (y mod q)) mod q
(xy) mod q = ((x mod q)(y mod q)) mod q
Thus we can quickly compute H(AB) from H(A) and H(B), and H(B) from
H(AB) and H(A). We will see applications of this later.
If q and r are coprime, then r has a multiplicative inverse r−1 modulo q, and
we can also compute H(A) = ((H(AB) − H(B)) · (r−1 )|B| ) mod q.
68
The parameters q and r have to be chosen with some care to ensure that
collisions are rare for any reasonable set of strings.
• If q and r were both powers of two, then only the last d(log q)/ log re
characters of the string would affect the hash value. More generally, q
and r should be coprime, i.e, have no common divisors other than 1.
Automata support set inclusion testing but not other trie operations:
• No insertions and deletions
• No satellite data, i.e., data associated to each string
70
Sets of Strings: Summary
• Storing and searching: trie and ternary trie and their compact versions,
string binary search, Karp–Rabin hashing.
• Sorting: string quicksort and mergesort, LSD and MSD radix sort.
Lower bounds:
71
Selected Literature
• Trie:
Fredkin: Trie Memory. Communications of the ACM. 3(9),
1960, pp. 490–499.
• Compact trie:
Morrison: PATRICIA—Practical Algorithm To Retrieve
Information Coded in Alphanumeric. Journal of the ACM, 15(4),
1968, pp. 514–534.
72
• String mergesort:
Ng & Kakehi: Merging String Sequences by Longest Common
Prefixes. IPSJ Journal, 49(2), 2008, pp. 958–967.
• Karp–Rabin hashing:
Karp & Rabin: Efficient randomized pattern-matching
algorithms. IBM Journal of Research and Development, 31 (2),
1987, pp. 249–260.
73