Lecture 04 Inaryseachtree
Lecture 04 Inaryseachtree
• Let Sv be the string stored at a node v in a binary search tree. Let S<
and S> be the closest lexicographically smaller and larger strings stored
at ancestors of v.
• The comparison of a query string P and the string Sv is done the same
way as the comparison of P and Smid in string binary search. The roles
of Slef t and Sright are taken by S< and S> .
• If each node v stores the values lcp(S< , Sv ) and lcp(Sv , S> ), then a
search in a balanced search tree can be executed in O(|P | + log n) time.
Other operations including insertions and deletions take O(|P | + log n)
time too.
61
Hashing and Fingerprints
Hashing is also used in other situations, where one needs to check whether
two strings S and T are the same or not:
• If H(S) 6= H(T ), then we must have S 6= T .
• If H(S) = H(T ), then S = T and S 6= T are both possible.
If S 6= T , this is called a collision.
When used this way, the hash value is often called a fingerprint, and its
range [0..q) is typically large as it is not restricted by a hash table size.
62
Any good hash function must depend on all characters. Thus computing
H(S) needs Ω(|S|) time, which can defeat the advantages of hashing:
• A plain comparison of two strings is faster than computing the hashes.
• The main strength of hash tables is the support for constant time
insertions and deletions, but inserting a string S into a hash table needs
Ω(|S|) time when the hash computation time is included. Compare this
to the O(|S|) time for a trie under a constant alphabet and the
O(|S| + log n) time for a ternary trie.
• A hash value can be computed once, stored, and used many times.
• Some hash functions can be computed more efficiently for a related set
of strings. An example is the Karp–Rabin hash function.
63
Definition 1.37: The Karp–Rabin hash function for a string
S = s0 s1 . . . sm−1 over an integer alphabet is
H(S) = (s0 rm−1 + s1 rm−2 + · · · + sm−2 r + sm−1 ) mod q
for some fixed positive integers q and r.
Lemma 1.38: For any two strings A and B,
H(AB) = (H(A) · r|B| + H(B)) mod q
H(B) = (H(AB) − H(A) · r|B| ) mod q
Proof. Without the modulo operation, the result would be obvious. The
modulo does not interfere because of the rules of modular arithmetic:
(x + y) mod q = ((x mod q) + (y mod q)) mod q
(xy) mod q = ((x mod q)(y mod q)) mod q
Thus we can quickly compute H(AB) from H(A) and H(B), and H(B) from
H(AB) and H(A). We will see applications of this later.
If q and r are coprime, then r has a multiplicative inverse r−1 modulo q, and
we can also compute H(A) = ((H(AB) − H(B)) · (r−1 )|B| ) mod q.
64
The parameters q and r have to be chosen with some care to ensure that
collisions are rare for any reasonable set of strings.
• If q and r were both powers of two, then only the last d(log q)/ log re
characters of the string would affect the hash value. More generally, q
and r should be coprime, i.e, have no common divisors other than 1.
Finite automata are a well known way of representing sets of strings. In this
case, the set is often called a language.
emp
66
Automata are much more powerful than tries in representing languages:
• Infinite languages
• Nondeterministic automata
• Even an acyclic, deterministic automaton can represent a language of
exponential size.
67
Sets of Strings: Summary
Lower bounds:
• Many of the algorithms are optimal.
• General purpose algorithms are asymptotically slower.
68
2. Exact String Matching
Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P
occurs in T at position j if T [j..j + m) = P .
In this part, we will describe algorithms that solve the following problem.
Problem 2.1: Given text T [0..n) and pattern P [0..m), report the first
position in T where P occurs, or n if P does not occur in T .
The algorithms can be easily modified to solve the following problems too.
• Existence: Is P a factor of T ?
• Counting: Count the number of occurrences of P in T .
• Listing: Report all occurrences of P in T .
69
The naive, brute force algorithm compares P against T [0..m), then against
T [1..1 + m), then against T [2..2 + m) etc. until an occurrence is found or
the end of the text is reached.
The worst case time complexity is O(mn). This happens, for example, when
P = am−1 b = aaa..ab and T = an = aaaaaa..aa.
70
Knuth–Morris–Pratt
The Brute force algorithm forgets everything when it moves to the next
text position.
Example 2.3:
Brute force Morris–Pratt Knuth–Morris–Pratt
ainaisesti-ainainen ainaisesti-ainainen ainaisesti-ainainen
ainai//nen (6 comp.) nen (6)
ainai// nen (6)
ainai//
ainainen (1)
// nainen (1)
ai// ainainen (1)
//
ainainen (1)
/ ainainen (1)
//
nainen (3)
ai//
/inainen (1)
a
ainainen (1)
//
71
MP and KMP algorithms never go backwards in the text. When they
encounter a mismatch, they find another pattern position to compare
against the same text position. If the mismatch occurs at pattern position i,
then f ail[i] is the next pattern position to compare.
The only difference between MP and KMP is how they compute the failure
function f ail.
Algorithm 2.4: Knuth–Morris–Pratt / Morris–Pratt
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
(1) compute f ail[0..m]
(2) i ← 0; j ← 0
(3) while i < m and j < n do
(4) if i = −1 or P [i] = T [j] then i ← i + 1; j ← j + 1
(5) else i ← f ail[i]
(6) if i = m then return j − m else return n
• f ail[i] = −1 means that there is no more pattern positions to compare
against this text positions and we should move to the next text
position.
• f ail[m] is never needed here, but if we wanted to find all occurrences, it
would tell how to continue after a full match.
72
We will describe the MP failure function here. The KMP failure function is
left for the exercises.
• When the algorithm finds a mismatch between P [i] and T [j], we know
that P [0..i) = T [j − i..j).
• Now we want to find a new i0 < i such that P [0..i0 ) = T [j − i0 ..j).
Specifically, we want the largest such i0 .
• This means that P [0..i0 ) = T [j − i0 ..j) = P [i − i0 ..i). In other words,
P [0..i0 ) is the longest proper border of P [0..i).
73
Example 2.5: Let P = ainainen. i P [0..i) border f ail[i]
0 ε – -1
1 a ε 0
2 ai ε 0
3 ain ε 0
4 aina a 1
5 ainai ai 2
6 ainain ain 3
7 ainaine ε 0
8 ainainen ε 0
Σ a i n a i n e n
-1 0 1 2 3 4 5 6 7 8
74
An efficient algorithm for computing the failure function is very similar to
the search algorithm itself!
• In the MP algorithm, when we find a match P [i] = T [j], we know that
P [0..i] = T [j − i..j]. More specifically, P [0..i] is the longest prefix of P
that matches a suffix of T [0..j].
• Suppose T = #P [1..m), where # is a symbol that does not occur in P .
Finding a match P [i] = T [j], we know that P [0..i] is the longest prefix
of P that is a proper suffix of P [0..j]. Thus f ail[j + 1] = i + 1.
• When the algorithm reads f ail[i] on line 4, f ail[i] has already been
computed.
75
Theorem 2.7: Algorithms MP and KMP preprocess a pattern in time O(m)
and then search the text in time O(n).
Proof. We show that the text search requires O(n) time. Exactly the same
argument shows that pattern preprocessing needs O(m) time.
else Here i decreases since f ail[i] < i. Since i only increases in the
then-branch, this branch cannot be taken more often than the
then-branch.
76
Shift-And (Shift-Or)
When the MP algorithm is at position j in the text T , it computes the
longest prefix of the pattern P [0..m) that is a suffix of T [0..j]. The
Shift-And algorithm computes all prefixes of P that are suffixes of T [0..j].
• The information is stored in a bitvector D of length m, where D.i = 1 if
P [0..i] = T [j − i..j] and D.i = 0 otherwise. (D.0 is the least significant
bit.)
• When D.(m − 1) = 1, we have found an occurrence.
78
Example 2.9: P = assi, T = apassi, bitvectors are columns.
a s s i
-1 0 1 2 3
After processing T [j], D.i = 1 if and only if there is a path from the initial
state (state -1) to state i with the string T [0..j].
79
On an integer alphabet when m ≤ w:
• Preprocessing time is O(σ + m).
• Search time is O(n).
If m > w, we can store the bitvectors in dm/we machine words and perform
each bitvector operation in O(dm/we) time.
• Preprocessing time is O(σdm/we + m).
• Search time is O(ndm/we).
If no pattern prefix longer than w matches a current text suffix, then only
the least significant machine word contains 1’s. There is no need to update
the other words; they will stay 0.
• Then the search time is O(n) on average.
80