Lecture 03

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Lcp-Comparisons

General (non-string) comparison-based sorting algorithms are not optimal


for sorting strings because of an imbalance between effort and result in a
string comparison: it can take a lot of time but the result is only a bit or a
trit of useful information.
String quicksort solves this problem by processing the obtained information
immediately after each symbol comparison.
An opposite approach is to replace a standard string comparison with an
lcp-comparison, which is the operation LcpCompare(A, B, k):

• The return value is the pair (x, `), where x ∈ {<, =, >} indicates the
order, and ` = lcp(A, B), the length of the longest common prefix of
strings A and B.

• The input value k is the length of a known common prefix, i.e., a lower
bound on lcp(A, B). The comparison can skip the first k characters.

The extra time spent in the comparison is balanced by the extra information
obtained in the form of the lcp value.
49
The following result shows how we can use the information from earlier
comparisons to obtain a lower bound or even the exact value for an lcp.
Lemma 1.30: Let A, B and C be strings.

(a) lcp(A, C) ≥ min{lcp(A, B), lcp(B, C)}.


(b) If A ≤ B ≤ C, then lcp(A, C) = min{lcp(A, B), lcp(B, C)}.
(c) If lcp(A, B) 6= lcp(B, C), then lcp(A, C) = min{lcp(A, B), lcp(B, C)}.

Proof. Assume ` = lcp(A, B) ≤ lcp(B, C). The opposite case


lcp(A, B) ≥ lcp(B, C) is symmetric.

(a) Now A[0..`) = B[0..`) = C[0..`) and thus lcp(A, C) ≥ `.


(b) Either |A| = ` or A[`] < B[`] ≤ C[`]. In either case, lcp(A, C) = `.
(c) Now lcp(A, B) < lcp(B, C). If lcp(A, C) > min{lcp(A, B), lcp(B, C)}, then
lcp(A, B) < min{lcp(A, C), lcp(B, C)}, which violates (a).

The above means that the three lcp values between three strings can never
be three different values. At least two of them are the same and the third
one is the same or bigger.
50
It can also be possible to determine the order of two strings without
comparing them directly.

Lemma 1.31: Let A, B, B 0 and C be strings such that A ≤ B ≤ C and


A ≤ B 0 ≤ C.

(a) If lcp(A, B) > lcp(A, B 0 ), then B < B 0 .


(b) If lcp(B, C) > lcp(B 0 , C), then B > B 0 .

Proof. We show (a); (b) is symmetric. Assume to the contrary that B ≥ B 0 .


Then by Lemma 1.30, lcp(A, B) = min{lcp(A, B 0 ), lcp(B 0 , B)} ≤ lcp(A, B 0 ),
which is a contradiction. 

Intuitively, the above result makes sense if you think of lcp(·, ·) as a measure
of similarity between two strings. The higher the lcp, the closer the two
strings are lexicographically.

51
String Mergesort
String mergesort is a string sorting algorithm that uses lcp-comparisons. It
has the same structure as the standard mergesort: sort the first half and the
second half separately, and then merge the results.
Algorithm 1.32: StringMergesort(R)
Input: Set R = {S1 , S2 , . . . , Sn } of strings.
Output: R sorted and augmented with LCPR values.
(1) if |R| = 1 then return ((S1 , 0))
(2) m ← bn/2c
(3) P ← StringMergesort({S1 , S2 , . . . , Sm })
(4) Q ← StringMergesort({Sm+1 , Sm+2 , . . . , Sn })
(5) return StringMerge(P, Q)

The output is of the form


((T1 , `1 ), (T2 , `2 ), . . . , (Tn , `n ))
where `i = lcp(Ti , Ti−1 ) for i > 1 and `1 = 0. In other words, `i = LCPR [i].
Thus we get not only the order of the strings but also a lot of information
about their common prefixes. The procedure StringMerge uses this
information effectively.
52
Algorithm 1.33: StringMerge(P,Q)  
Input: Sequences P = (S1 , k1 ), . . . , (Sm , km ) and Q = (T1 , `1 ), . . . , (Tn , `n )
Output: Merged sequence R
(1) R ← ∅; i ← 1; j ← 1
(2) while i ≤ m and j ≤ n do
(3) if ki > `j then append (Si , ki ) to R; i ← i + 1
(4) else if `j > ki then append (Tj , `j ) to R; j ← j + 1
(5) else // ki = `j
(6) (x, h) ← LcpCompare(Si , Tj , ki )
(7) if x = ”<” then
(8) append (Si , ki ) to R; i ← i + 1
(9) `j ← h
(10) else
(11) append (Tj , `j ) to R; j ← j + 1
(12) ki ← h
(13) while i ≤ m do append (Si , ki ) to R; i ← i + 1
(14) while j ≤ n do append (Tj , `j ) to R; j ← j + 1
(15) return R

53
Lemma 1.34: StringMerge performs the merging correctly.
Proof. We will show that the following invariant holds at the beginning of
each round in the loop on lines (2)–(12):

Let X be the last string appended to R (or ε if R = ∅). Then


ki = lcp(X, Si ) and `j = lcp(X, Tj ).

The invariant is clearly true in the beginning. We will show that the invariant
is maintained and the smaller string is chosen in each round of the loop.

• If ki > `j , then lcp(X, Si) > lcp(X, Tj ) and thus


– Si < Tj by Lemma 1.31.
– lcp(Si , Tj ) = lcp(X, Tj ) because, by Lemma 1.30,
lcp(X, Tj ) = min{lcp(X, Si ), lcp(Si , Tj )}.
Hence, the algorithm chooses the smaller string and maintains the
invariant. The case `j > ki is symmetric.

• If ki = `j , then clearly lcp(Si, Tj ) ≥ ki and the call to LcpCompare is safe,


and the smaller string is chosen. The update `j ← h or ki ← h maintains
the invariant. 
54
Theorem 1.35: String mergesort sorts a set R of n strings in
O(ΣLCP (R) + n log n) time.

Proof. If the calls to LcpCompare took constant time, the time complexity
would be O(n log n) by the same argument as with the standard mergesort.

Whenever LcpCompare makes more than one, say t + 1 symbol


comparisons, one of the lcp values stored with the strings increases by t.
Since the sum of the final lcp values is exactly ΣLCP (R), the extra time
spent in LcpCompare is bounded by O(ΣLCP (R)).


• Other comparison based sorting algorithms, for example heapsort and


insertion sort, can be adapted for strings using the lcp-comparison
technique.

55
String Binary Search

An ordered array is a simple static data structure supporting queries in


O(log n) time using binary search.

Algorithm 1.36: Binary search


Input: Ordered set R = {k1 , k2 , . . . , kn }, query value x.
Output: The number of elements in R that are smaller than x.
(1) lef t ← 0; right ← n + 1 // output value is in the range [lef t..right)
(2) while right − lef t > 1 do
(3) mid ← b(lef t + right)/2c
(4) if kmid < x then lef t ← mid
(5) else right ← mid
(6) return lef t

With strings as elements, however, the query time is

• O(m log n) in the worst case for a query string of length m.

• O(log n logσ n) on average for a random set of strings.

56
We can use the lcp-comparison technique to improve binary search for
strings. The following is a key result.

Lemma 1.37: Let A, B, B 0 and C be strings such that A ≤ B ≤ C and


A ≤ B 0 ≤ C. Then lcp(B, B 0 ) ≥ lcp(A, C).

Proof. Let Bmin = min{B, B 0 } and Bmax = max{B, B 0 }. By Lemma 1.30,


lcp(A, C) = min(lcp(A, Bmax ), lcp(Bmax , C))
≤ lcp(A, Bmax ) = min(lcp(A, Bmin ), lcp(Bmin , Bmax ))
≤ lcp(Bmin , Bmax ) = lcp(B, B 0 )


57
During the binary search of P in {S1 , S2 , . . . , Sn }, the basic situation is the
following:

• We want to compare P and Smid.

• We have already compared P against Slef t and Sright, and we know that
Slef t ≤ P, Smid ≤ Sright .

• By using lcp-comparisons, we know lcp(Slef t, P ) and lcp(P, Sright).

By Lemmas 1.30 and 1.37,


lcp(P, Smid ) ≥ lcp(Slef t , Sright ) = min{lcp(Slef t , P ), lcp(P, Sright )}
Thus we can skip min{lcp(Slef t , P ), lcp(P, Sright )} first characters when
comparing P and Smid .

58
Algorithm 1.38: String binary search (without precomputed lcps)
Input: Ordered string set R = {S1 , S2 , . . . , Sn }, query string P .
Output: The number of strings in R that are smaller than P .
(1) lef t ← 0; right ← n + 1
(2) llcp ← 0 // llcp = lcp(Slef t , P )
(3) rlcp ← 0 // rlcp = lcp(P, Sright )
(4) while right − lef t > 1 do
(5) mid ← b(lef t + right)/2c
(6) mlcp ← min{llcp, rlcp}
(7) (x, mlcp) ← LcpCompare(P, Smid , mlcp)
(8) if x = “ < ” then right ← mid; rlcp ← mclp
(9) else lef t ← mid; llcp ← mclp
(10) return lef t

• The average case query time is now O(log n).

• The worst case query time is still O(m log n) (exercise).

59
We can improve the worst case complexity by choosing the midpoint closer
to the smaller lcp value:

• If llcp − rlcp > 1, choose the middle position closer to the right.

• This is achieved by choosing the midpoint as weighted average of the


left position and the right position. The weights are d and ln(d + 1),
where d = llcp − rlcp.

• If rlcp − llcp > 1, choose the middle position closer to the left in a
symmetric way.

• The worst case time complexity of the resulting algorithm (shown on


the next slide) is O(m logm n). The proof is omitted here.

• The lower bound on string binary searching time has been shown to be
 
m log log n
Θ   + m + log n .
log log m log log n
log n

There is a complicated algorithm achieving this time complexity.

60
Algorithm 1.39: Skewed string binary search (without precomputed lcps)
Input: Ordered string set R = {S1 , S2 , . . . , Sn }, query string P .
Output: The number of strings in R that are smaller than P .
(1) lef t ← 0; right ← n + 1
(2) llcp ← 0 // llcp = lcp(Slef t , P )
(3) rlcp ← 0 // rlcp = lcp(P, Sright )
(4) while right − lef t > 1 do
(5) if llcp − rlcp > 1 then
(6) d ← llcp − rlcp
(7) mid ← d((ln(d + 1)) · lef t + d · right)/(d + ln(d + 1))e
(8) else if rlcp − llcp > 1 then
(9) d ← rlcp − llcp
(10) mid ← b(d · lef t + (ln(d + 1)) · right)/(d + ln(d + 1))c
(11) else
(12) mid ← b(lef t + right)/2c
(13) mlcp ← min{llcp, rlcp}
(14) (x, mlcp) ← LcpCompare(P, Smid , mlcp)
(15) if x = “ < ” then right ← mid; rlcp ← mclp
(16) else lef t ← mid; llcp ← mclp
(17) return lef t

61
The lower bound above assumes that no other information besides the
ordering of the strings is given. We can further improve string binary
searching by using precomputed information about the lcp’s between the
strings in R.
Consider again the basic situation during string binary search:

• We want to compare P and Smid.

• We have already compared P against Slef t and Sright, and we know


lcp(Slef t , P ) and lcp(P, Sright ).

In the unskewed algorithm, the values lef t and right are fully determined by
mid independently of P . That is, P only determines whether the search ends
up at position mid at all, but if it does, lef t and right are always the same.
Thus, we can precompute and store the values
LLCP [mid] = lcp(Slef t , Smid )
RLCP [mid] = lcp(Smid , Sright )
Now we know all lcp values between P , Slef t , Smid , Sright except lcp(P, Smid ).
The following lemma shows how to utilize this.
62
Lemma 1.40: Let A, B, B 0 and C be strings such that A ≤ B ≤ C and
A ≤ B 0 ≤ C.

(a) If lcp(A, B) > lcp(A, B 0 ), then B < B 0 and lcp(B, B 0 ) = lcp(A, B 0 ).


(b) If lcp(A, B) < lcp(A, B 0 ), then B > B 0 and lcp(B, B 0 ) = lcp(A, B).
(c) If lcp(B, C) > lcp(B 0 , C), then B > B 0 and lcp(B, B 0 ) = lcp(B 0 , C).
(d) If lcp(B, C) < lcp(B 0 , C), then B < B 0 and lcp(B, B 0 ) = lcp(B, C).
(e) If lcp(A, B) = lcp(A, B 0 ) and lcp(B, C) = lcp(B 0 , C), then
lcp(B, B 0 ) ≥ max{lcp(A, B), lcp(B, C)}.

Proof. Cases (a)–(d) are symmetrical, we show (a). B < B 0 follows from
Lemma 1.31. Then by Lemma 1.30, lcp(A, B 0 ) = min{lcp(A, B), lcp(B, B 0 )}.
Since lcp(A, B 0 ) < lcp(A, B), we must have lcp(A, B 0 ) = lcp(B, B 0 ).

In case (e), we use Lemma 1.30:


lcp(B, B 0 ) ≥ min{lcp(A, B), lcp(A, B 0 )} = lcp(A, B)
lcp(B, B 0 ) ≥ min{lcp(B, C), lcp(B 0 , C)} = lcp(B, C)
Thus lcp(B, B 0 ) ≥ max{lcp(A, B), lcp(B, C)}. 
63
Algorithm 1.41: String binary search (with precomputed lcps)
Input: Ordered string set R = {S1 , S2 , . . . , Sn }, arrays LLCP and RLCP,
query string P .
Output: The number of strings in R that are smaller than P .
(1) lef t ← 0; right ← n + 1
(2) llcp ← 0; rlcp ← 0
(3) while right − lef t > 1 do
(4) mid ← b(lef t + right)/2c
(5) if LLCP [mid] > llcp then lef t ← mid
(6) else if LLCP [mid] < llcp then right ← mid; rlcp ← LLCP [mid]
(7) else if RLCP [mid] > rlcp then right ← mid
(8) else if RLCP [mid] < rlcp then lef t ← mid; llcp ← RLCP [mid]
(9) else
(10) mlcp ← max{llcp, rlcp}
(11) (x, mlcp) ← LcpCompare(P, Smid , mlcp)
(12) if x = “ < ” then right ← mid; rlcp ← mclp
(13) else lef t ← mid; llcp ← mclp
(14) return lef t

64
Theorem 1.42: An ordered string set R = {S1 , S2 , . . . , Sn } can be
preprocessed in O(ΣLCP (R) + n) time and O(n) space so that a binary
search with a query string P can be executed in O(|P | + log n) time.

Proof. The values LLCP [mid] and RLCP [mid] can be computed in
O(lcp(Smid , R \ {Smid }) + 1) time. Thus the arrays LLCP and RLCP can be
computed in O(Σlcp(R) + n) = O(ΣLCP (R) + n) time and stored in O(n)
space.

The main while loop in Algorithm 1.41 is executed O(log n) times and
everything except LcpCompare on line (11) needs constant time.

If a given LcpCompare call performs t + 1 symbol comparisons, mclp


increases by t on line (11). Then on lines (12)–(13), either llcp or rlcp
increases by at least t, since mlcp was max{llcp, rlcp} before LcpCompare.
Since llcp and rlcp never decrease and never grow larger than |P |, the total
number of extra symbol comparisons in LcpCompare during the binary
search is O(|P |). 

Other comparison-based data structures such as binary search trees can be


augmented with lcp information in the same way (study groups).

65
Hashing and Fingerprints
Hashing is a powerful technique for dealing with strings based on mapping
each string to an integer using a hash function:
H : Σ∗ → [0..q) ⊂ N
The most common use of hashing is with hash tables. Hash tables come in
many flavors that can be used with strings as well as with any other type of
object with an appropriate hash function. A drawback of using a hash table
to store a set of strings is that they do not support lcp and prefix queries.

Hashing is also used in other situations, where one needs to check whether
two strings S and T are the same or not:

• If H(S) 6= H(T ), then we must have S 6= T .

• If H(S) = H(T ), then S = T and S 6= T are both possible.


If S 6= T , this is called a collision.

When used this way, the hash value is often called a fingerprint, and its
range [0..q) is typically large as it is not restricted by a hash table size.
66
Any good hash function must depend on all characters. Thus computing
H(S) needs Ω(|S|) time, which can defeat the advantages of hashing:

• A plain comparison of two strings is faster than computing the hashes.

• The main strength of hash tables is the support for constant time
insertions and queries, but for example inserting a string S into a hash
table needs Ω(|S|) time when the hash computation time is included.
Compare this to the O(|S|) time for a trie under a constant alphabet
and the O(|S| + log n) time for a ternary trie.

However, a hash table can still be competitive in practice. Furthermore,


there are situations, where a full computation of the hash function can be
avoided:

• A hash value can be computed once, stored, and used many times.

• Some hash functions can be computed more efficiently for a related set
of strings. An example is the Karp–Rabin hash function.

67
Definition 1.43: The Karp–Rabin hash function for a string
S = s0 s1 . . . sm−1 over an integer alphabet is
H(S) = (s0 rm−1 + s1 rm−2 + · · · + sm−2 r + sm−1 ) mod q
for some fixed positive integers q and r.
Lemma 1.44: For any two strings A and B,
H(AB) = (H(A) · r|B| + H(B)) mod q
H(B) = (H(AB) − H(A) · r|B| ) mod q

Proof. Without the modulo operation, the result would be obvious. The
modulo does not interfere because of the rules of modular arithmetic:
(x + y) mod q = ((x mod q) + (y mod q)) mod q
(xy) mod q = ((x mod q)(y mod q)) mod q

Thus we can quickly compute H(AB) from H(A) and H(B), and H(B) from
H(AB) and H(A). We will see applications of this later.
If q and r are coprime, then r has a multiplicative inverse r−1 modulo q, and
we can also compute H(A) = ((H(AB) − H(B)) · (r−1 )|B| ) mod q.
68
The parameters q and r have to be chosen with some care to ensure that
collisions are rare for any reasonable set of strings.

• The original choice is r = σ and q is a large prime.


• Another possibility is that q is a power of two and r is a small prime
(r = 37 has been suggested). This is faster in practice, because the
slow modulo operations can be replaced by bitwise shift operations. If
q = 2w , where w is the machine word size, the modulo operations can
be omitted completely. (But a bad case for this is a Thue-Morse
sequence.)

• If q and r were both powers of two, then only the last d(log q)/ log re
characters of the string would affect the hash value. More generally, q
and r should be coprime, i.e, have no common divisors other than 1.

• The hash function can be randomized by choosing q or r randomly. For


example, if q is a prime and r is chosen uniformly at random from [0..q),
the probability that two strings of length m collide is at most m/q.

• A random choice over a set of possibilities has the additional advantage


that we can change the choice if the first choice leads to too many
collisions.
69
Automata
Finite automata are a well known way of representing sets of strings. In this
case, the set is often called a (regular) language.
A trie is a special type of an automaton.
• The root is the initial state, the leaves are accept states, ...
• Trie is generally not a minimal automaton.
• Trie techniques including path compaction can be applied to automata.

Automata are much more powerful than tries in representing languages:


• Infinite languages
• Nondeterministic automata
• Even an acyclic, deterministic automaton can represent a language of
exponential size.

Automata support set inclusion testing but not other trie operations:
• No insertions and deletions
• No satellite data, i.e., data associated to each string

70
Sets of Strings: Summary

Efficient algorithms and data structures for sets of strings:

• Storing and searching: trie and ternary trie and their compact versions,
string binary search, Karp–Rabin hashing.
• Sorting: string quicksort and mergesort, LSD and MSD radix sort.

Lower bounds:

• Many of the algorithms are optimal.


• General purpose algorithms are asymptotically slower.

The central role of longest common prefixes:

• LCP array LCPR and its sum ΣLCP (R).


• Lcp-comparison technique.

71
Selected Literature

• Trie:
Fredkin: Trie Memory. Communications of the ACM. 3(9),
1960, pp. 490–499.

• Compact trie:
Morrison: PATRICIA—Practical Algorithm To Retrieve
Information Coded in Alphanumeric. Journal of the ACM, 15(4),
1968, pp. 514–534.

• Ternary trie and string quicksort:


Bentley & Sedgewick: Fast algorithms for sorting and searching
strings. Proc. 8th Annual ACM–SIAM Symposium on Discrete
Algorithms (SODA), 1997, pp. 36–369.

• MSD radix sort in O(ΣLCP (R) + n + σ) time:


Paige & Tarjan: Three partition refinement algorithms. SIAM
Journal on Computing 16(6), 1987, pp. 973–989.

72
• String mergesort:
Ng & Kakehi: Merging String Sequences by Longest Common
Prefixes. IPSJ Journal, 49(2), 2008, pp. 958–967.

• Complexity of string binary search without precomputed lcp information:


Andersson, Hagerup,, Håstad, & Petersson: Tight bounds for
searching a sorted array of strings. SIAM Journal on Computing,
30(5), 2000, pp. 1552–1578.

• LCP array and string binary search using lcp information:


Manber & Myers: Suffix Arrays: A New Method for On-Line
String Searches. SIAM Journal on Computing. 22(5), 1993,
pp. 935–948.

• Karp–Rabin hashing:
Karp & Rabin: Efficient randomized pattern-matching
algorithms. IBM Journal of Research and Development, 31 (2),
1987, pp. 249–260.

73

You might also like