0% found this document useful (0 votes)
17 views20 pages

Lecture 04 Inaryseachtree

The document discusses string matching algorithms: 1) The naive brute force algorithm compares the pattern against each substring of the text and has O(mn) worst-case time complexity. 2) The Knuth-Morris-Pratt (KMP) algorithm and Morris-Pratt (MP) algorithm remember matches and mismatches to avoid re-checking text characters, running in O(n+m) time by using a "failure function" to determine the next pattern position to compare. 3) The KMP algorithm differs from MP in how it computes the failure function to remember both matches and mismatches from the pattern.

Uploaded by

Koena Mpheteng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views20 pages

Lecture 04 Inaryseachtree

The document discusses string matching algorithms: 1) The naive brute force algorithm compares the pattern against each substring of the text and has O(mn) worst-case time complexity. 2) The Knuth-Morris-Pratt (KMP) algorithm and Morris-Pratt (MP) algorithm remember matches and mismatches to avoid re-checking text characters, running in O(n+m) time by using a "failure function" to determine the next pattern position to compare. 3) The KMP algorithm differs from MP in how it computes the failure function to remember both matches and mismatches from the pattern.

Uploaded by

Koena Mpheteng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

String Binary Search Trees

Binary search can be seen as a search on an implicit binary search tree,


where the middle element is the root, the middle elements of the first and
second half are the children of the root, etc.. The string binary search
technique can be extended for arbitrary binary search trees.

• Let Sv be the string stored at a node v in a binary search tree. Let S<
and S> be the closest lexicographically smaller and larger strings stored
at ancestors of v.

• The comparison of a query string P and the string Sv is done the same
way as the comparison of P and Smid in string binary search. The roles
of Slef t and Sright are taken by S< and S> .

• If each node v stores the values lcp(S< , Sv ) and lcp(Sv , S> ), then a
search in a balanced search tree can be executed in O(|P | + log n) time.
Other operations including insertions and deletions take O(|P | + log n)
time too.

61
Hashing and Fingerprints

Hashing is a powerful technique for dealing with strings based on mapping


each string to an integer using a hash function:
H : Σ∗ → [0..q) ⊂ N
The most common use of hashing is with hash tables. Hash tables come in
many flavors that can be used with strings as well as with any other type of
object with an appropriate hash function. A drawback of using a hash table
to store a set of strings is that they do not support lcp and prefix queries.

Hashing is also used in other situations, where one needs to check whether
two strings S and T are the same or not:
• If H(S) 6= H(T ), then we must have S 6= T .
• If H(S) = H(T ), then S = T and S 6= T are both possible.
If S 6= T , this is called a collision.

When used this way, the hash value is often called a fingerprint, and its
range [0..q) is typically large as it is not restricted by a hash table size.

62
Any good hash function must depend on all characters. Thus computing
H(S) needs Ω(|S|) time, which can defeat the advantages of hashing:
• A plain comparison of two strings is faster than computing the hashes.
• The main strength of hash tables is the support for constant time
insertions and deletions, but inserting a string S into a hash table needs
Ω(|S|) time when the hash computation time is included. Compare this
to the O(|S|) time for a trie under a constant alphabet and the
O(|S| + log n) time for a ternary trie.

However, a hash table can still be competitive in practice. Furthermore,


there are situations, where a full computation of the hash function can be
avoided:

• A hash value can be computed once, stored, and used many times.

• Some hash functions can be computed more efficiently for a related set
of strings. An example is the Karp–Rabin hash function.

63
Definition 1.37: The Karp–Rabin hash function for a string
S = s0 s1 . . . sm−1 over an integer alphabet is
H(S) = (s0 rm−1 + s1 rm−2 + · · · + sm−2 r + sm−1 ) mod q
for some fixed positive integers q and r.
Lemma 1.38: For any two strings A and B,
H(AB) = (H(A) · r|B| + H(B)) mod q
H(B) = (H(AB) − H(A) · r|B| ) mod q

Proof. Without the modulo operation, the result would be obvious. The
modulo does not interfere because of the rules of modular arithmetic:
(x + y) mod q = ((x mod q) + (y mod q)) mod q
(xy) mod q = ((x mod q)(y mod q)) mod q

Thus we can quickly compute H(AB) from H(A) and H(B), and H(B) from
H(AB) and H(A). We will see applications of this later.
If q and r are coprime, then r has a multiplicative inverse r−1 modulo q, and
we can also compute H(A) = ((H(AB) − H(B)) · (r−1 )|B| ) mod q.
64
The parameters q and r have to be chosen with some care to ensure that
collisions are rare for any reasonable set of strings.

• The original choice is r = σ and q is a large prime.

• Another possibility is that q is a power of two and r is a small prime


(r = 37 has been suggested). This is faster in practice, because the
slow modulo operations can be replaced by bitwise shift operations. If
q = 2w , where w is the machine word size, the modulo operations can
be omitted completely.

• If q and r were both powers of two, then only the last d(log q)/ log re
characters of the string would affect the hash value. More generally, q
and r should be coprime, i.e, have no common divisors other than 1.

• The hash function can be randomized by choosing q or r randomly. For


example, if q is a prime and r is chosen uniformly at random from [0..q),
the probability that two strings of length m collide is at most m/q.

• A random choice over a set of possibilities has the additional advantage


that we can change the choice if the first choice leads to too many
collisions.
65
Automata

Finite automata are a well known way of representing sets of strings. In this
case, the set is often called a language.

A trie is a special type of an automaton.


• Trie is generally not a minimal automaton.
• Trie techniques including path compaction and ternary branching can
be applied to automata.

Example 1.39: Compacted minimal automaton for


R = {pot$, potato$, pottery$, tattoo$, tempo$}.
$
tery
pot at
o $
t atto

emp

66
Automata are much more powerful than tries in representing languages:
• Infinite languages
• Nondeterministic automata
• Even an acyclic, deterministic automaton can represent a language of
exponential size.

Automata do not support all operations of tries:


• Insertions and deletions
• Satellite data, i.e., data associated to each string.

67
Sets of Strings: Summary

Efficient algorithms and data structures for sets of strings:


• Storing and searching: trie and ternary trie and their compact versions,
string binary search and string binary search tree, Karp–Rabin hashing.
• Sorting: string quicksort and mergesort, LSD and MSD radix sort.

Lower bounds:
• Many of the algorithms are optimal.
• General purpose algorithms are asymptotically slower.

The central role of longest common prefixes:


• LCP array LCPR and its sum ΣLCP (R).
• Lcp-comparison technique.

68
2. Exact String Matching
Let T = T [0..n) be the text and P = P [0..m) the pattern. We say that P
occurs in T at position j if T [j..j + m) = P .

Example: P = aine occurs at position 6 in T = karjalainen.

In this part, we will describe algorithms that solve the following problem.

Problem 2.1: Given text T [0..n) and pattern P [0..m), report the first
position in T where P occurs, or n if P does not occur in T .

The algorithms can be easily modified to solve the following problems too.
• Existence: Is P a factor of T ?
• Counting: Count the number of occurrences of P in T .
• Listing: Report all occurrences of P in T .

69
The naive, brute force algorithm compares P against T [0..m), then against
T [1..1 + m), then against T [2..2 + m) etc. until an occurrence is found or
the end of the text is reached.

Algorithm 2.2: Brute force


Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
(1) i ← 0; j ← 0
(2) while i < m and j < n do
(3) if P [i] = T [j] then i ← i + 1; j ← j + 1
(4) else j ← j − i + 1; i ← 0
(5) if i = m then return j − m else return n

The worst case time complexity is O(mn). This happens, for example, when
P = am−1 b = aaa..ab and T = an = aaaaaa..aa.

70
Knuth–Morris–Pratt

The Brute force algorithm forgets everything when it moves to the next
text position.

The Morris–Pratt (MP) algorithm remembers matches. It never goes back


to a text character that already matched.

The Knuth–Morris–Pratt (KMP) algorithm remembers mismatches too.

Example 2.3:
Brute force Morris–Pratt Knuth–Morris–Pratt
ainaisesti-ainainen ainaisesti-ainainen ainaisesti-ainainen
ainai//nen (6 comp.) nen (6)
ainai// nen (6)
ainai//
ainainen (1)
// nainen (1)
ai// ainainen (1)
//
ainainen (1)
/ ainainen (1)
//
nainen (3)
ai//
/inainen (1)
a
ainainen (1)
//

71
MP and KMP algorithms never go backwards in the text. When they
encounter a mismatch, they find another pattern position to compare
against the same text position. If the mismatch occurs at pattern position i,
then f ail[i] is the next pattern position to compare.
The only difference between MP and KMP is how they compute the failure
function f ail.
Algorithm 2.4: Knuth–Morris–Pratt / Morris–Pratt
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
(1) compute f ail[0..m]
(2) i ← 0; j ← 0
(3) while i < m and j < n do
(4) if i = −1 or P [i] = T [j] then i ← i + 1; j ← j + 1
(5) else i ← f ail[i]
(6) if i = m then return j − m else return n
• f ail[i] = −1 means that there is no more pattern positions to compare
against this text positions and we should move to the next text
position.
• f ail[m] is never needed here, but if we wanted to find all occurrences, it
would tell how to continue after a full match.
72
We will describe the MP failure function here. The KMP failure function is
left for the exercises.
• When the algorithm finds a mismatch between P [i] and T [j], we know
that P [0..i) = T [j − i..j).
• Now we want to find a new i0 < i such that P [0..i0 ) = T [j − i0 ..j).
Specifically, we want the largest such i0 .
• This means that P [0..i0 ) = T [j − i0 ..j) = P [i − i0 ..i). In other words,
P [0..i0 ) is the longest proper border of P [0..i).

Example: ai is the longest proper border of ainai.


• Thus f ail[i] is the length of the longest proper border of P [0..i).
• P [0..0) = ε has no proper border. We set f ail[0] = −1.

73
Example 2.5: Let P = ainainen. i P [0..i) border f ail[i]
0 ε – -1
1 a ε 0
2 ai ε 0
3 ain ε 0
4 aina a 1
5 ainai ai 2
6 ainain ain 3
7 ainaine ε 0
8 ainainen ε 0

The (K)MP algorithm operates like an automaton, since it never moves


backwards in the text. Indeed, it can be described by an automaton that
has a special failure transition, which is an ε-transition that can be taken
only when there is no other transition to take.

Σ a i n a i n e n
-1 0 1 2 3 4 5 6 7 8

74
An efficient algorithm for computing the failure function is very similar to
the search algorithm itself!
• In the MP algorithm, when we find a match P [i] = T [j], we know that
P [0..i] = T [j − i..j]. More specifically, P [0..i] is the longest prefix of P
that matches a suffix of T [0..j].
• Suppose T = #P [1..m), where # is a symbol that does not occur in P .
Finding a match P [i] = T [j], we know that P [0..i] is the longest prefix
of P that is a proper suffix of P [0..j]. Thus f ail[j + 1] = i + 1.

Algorithm 2.6: Morris–Pratt failure function computation


Input: pattern P = P [0 . . . m)
Output: array f ail[0..m] for P
(1) i ← −1; j ← 0; f ail[j] ← i
(2) while j < m do
(3) if i = −1 or P [i] = P [j] then i ← i + 1; j ← j + 1; f ail[j] ← i
(4) else i ← f ail[i]
(5) return f ail

• When the algorithm reads f ail[i] on line 4, f ail[i] has already been
computed.

75
Theorem 2.7: Algorithms MP and KMP preprocess a pattern in time O(m)
and then search the text in time O(n).

Proof. We show that the text search requires O(n) time. Exactly the same
argument shows that pattern preprocessing needs O(m) time.

It is sufficient to count the number of comparisons that the algorithms


make. After each comparison P [i] = T [j], one of the two conditional
branches is executed:

then Here j is incremented. Since j never decreases, this branch can be


taken at most n + 1 times.

else Here i decreases since f ail[i] < i. Since i only increases in the
then-branch, this branch cannot be taken more often than the
then-branch.


76
Shift-And (Shift-Or)
When the MP algorithm is at position j in the text T , it computes the
longest prefix of the pattern P [0..m) that is a suffix of T [0..j]. The
Shift-And algorithm computes all prefixes of P that are suffixes of T [0..j].
• The information is stored in a bitvector D of length m, where D.i = 1 if
P [0..i] = T [j − i..j] and D.i = 0 otherwise. (D.0 is the least significant
bit.)
• When D.(m − 1) = 1, we have found an occurrence.

The bitvector D is updated at each text position j:


• There are precomputed bitvectors B[c], for all c ∈ Σ, where B[c].i = 1 if
P [i] = c and B[c].i = 0 otherwise.
• D is updated in two steps:
1. D ← (D << 1) + 1 (the bitwise shift). Now D tells, which prefixes
would match if T [j] would match every character.
2. D ← D & B[T [j]] (the bitwise and). Remove the prefixes where T [j]
does not match.
77
Let w be the wordsize of the computer, typically 64. Assume first that
m ≤ w. Then each bitvector can be stored in a single integer.

Algorithm 2.8: Shift-And


Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
Preprocess:
(1) for c ∈ Σ do B[c] ← 0
(2) for i ← 0 to m − 1 do B[P [i]] ← B[P [i]] + 2i // B[P [i]].i ← 1
Search:
(3) D ← 0
(4) for j ← 0 to n − 1 do
(5) D ← ((D << 1) + 1) & B[T [j]]
(6) if D & 2m−1 6= 0 then return j − m + 1 // D.(m − 1) = 1
(7) return n

Shift-Or is a minor optimization of Shift-And. It is the same algorithm


except the roles of 0’s and 1’s in the bitvectors have been swapped. Then
& on line 5 is replaced by | (bitwise or). The advantage is that we don’t
need that “+1” on line 5.

78
Example 2.9: P = assi, T = apassi, bitvectors are columns.

B[c], c ∈ {a,i,p,s} D at each step


a i p s a p a s s i
a 1 0 0 0 a 0 1 0 1 0 0 0
s 0 0 0 1 s 0 0 0 0 1 0 0
s 0 0 0 1 s 0 0 0 0 0 1 0
i 0 1 0 0 i 0 0 0 0 0 0 1

The Shift-And algorithm can also be seen as a bitparallel simulation of the


nondeterministic automaton that accepts a string ending with P .
Σ

a s s i
-1 0 1 2 3

After processing T [j], D.i = 1 if and only if there is a path from the initial
state (state -1) to state i with the string T [0..j].

79
On an integer alphabet when m ≤ w:
• Preprocessing time is O(σ + m).
• Search time is O(n).

If m > w, we can store the bitvectors in dm/we machine words and perform
each bitvector operation in O(dm/we) time.
• Preprocessing time is O(σdm/we + m).
• Search time is O(ndm/we).

If no pattern prefix longer than w matches a current text suffix, then only
the least significant machine word contains 1’s. There is no need to update
the other words; they will stay 0.
• Then the search time is O(n) on average.

Algorithms like Shift-And that take advantage of the implicit parallelism in


bitvector operations are called bitparallel.

80

You might also like