Module 06. String Algorithms Lecture 3-6
Module 06. String Algorithms Lecture 3-6
(Contd.)
-
Knuth Morris Pratt Algorithm
• The KMP matching algorithm uses degenerating property (pattern
having same sub-patterns appearing more than once in the pattern)
of the pattern and improves the worst case complexity to O(n).
• The basic idea behind KMP’s algorithm is: whenever we detect a
mismatch (after some matches), we already know some of the
characters in the text of the next window. We take advantage of this
information to avoid matching the characters that we know will
anyway match.
Matching Overview txt = "AAAAABAAABA"
pat = "AAAA"
12
Rabin-Karp Example
• Hash value of “AAAAA” is 37
• Hash value of “AAAAH” is 100
13
Rabin-Karp Algorithm
pattern is M characters long
hash_p=hash value of pattern
hash_t=hash value of first M letters in body of text
do
if (hash_p == hash_t)
brute force comparison of pattern and selected section of text
hash_t= hash value of next section of text, one character over
while (end of text)
14
Hash Function
• Let b be the number of letters in the alphabet. The text subsequence t[i .. i+M-1] is
mapped to the number
17
Rabin-Karp Mods
• If M is large, then the resulting value (~bM) will be enormous. For this reason, we
hash the value by taking it mod a prime number q.
• The mod function is particularly useful in this case due to several of its inherent
properties:
[(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
• For these reasons:
h(i)=((t[i] bM-1 mod q) +(t[i+1] bM-2 mod q) + … +(t[i+M-1] mod q))mod q
h(i+1) =( h(i) b mod q
Shift left one digit
-t[i] bM mod q
Subtract leftmost digit
+t[i+M] mod q )
Add new rightmost digit
mod q
18
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function,
the hashed values of two different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the
number of characters in the larger body of text.
• It is always possible to construct a scenario with a worst case
complexity of O(MN). This, however, is likely to happen only if the
prime number used for hashing is small.
19
Data structures (Tries and compressed tries) for strings
Tries
• Tries is an efficient information reTrieval data structure.
• Tries can reduce search complexities to optimal limit (key length).
• Binary search tree can reduce retrieval time to M * log N, where M is
maximum string length and N is number of keys in tree.
• Using Tries, we can search the key in O(M) time, but space complexity
for tries can be its limitation.
• All the descendant node in tries have the same prefix, hence tries
also know as prefix trees.
Trie Node
// Trie node
struct TrieNode
{
struct TrieNode *children[ALPHABET_SIZE];
// isEndOfWord is true if the node
// represents end of a word
bool isEndOfWord;
};
void insert(String s) {
for(every char in string s) { if(child node
belonging to current char is null) { child
node=new Node(); }
current_node=child_node; }
Mark isEndofWord }
Standard trie- {bear, bell, bid, bull, buy, sell,
stock, stop}
Compress Trie - obtained from standard trie by joining chains
of single nodes.
Searching in Tries
• Searching for a key is similar to insert operation
• compare the characters and move down.
• search can terminate due to end of string or lack of key in trie.
• In the former case, if the isEndofWord field of last node is true, then
the key exists in trie.
• In the second case, the search terminates without examining all the
characters of key, since the key is not present in trie.
Lecture 36-Suffix tree and Suffix array data structures and related
operation for string handling
Suffix Tree
• A Suffix Tree for a given text is a compressed trie for all suffixes of
the given text.
Example:
• Given words: {bear, bell, bid, bull, buy, sell, stock, stop}
Build a Suffix Tree for a given text
• 1) Generate all suffixes of given text.
2) Consider all suffixes as individual words and build a compressed
trie.
• example text “banana\0”
• banana\0
• anana\0
• nana\0
• ana\0
• na\0
• a\0
• \0
Search in Suffix Tree
Example:
Example:
• Panamabananas$
Possible suffixes:
index suffix
0 Panamabananas$
1 anamabananas$
2 namabananas$
3 amabananas$
4 mabananas$
5 abananas$
6 bananas$
index suffix
7 ananas$
8 nanas$
9 anas$
10 as$
11 s$
12 $
Pattern searching in suffix tree
• Step.1 : check if the given pattern really exists in string, for this,
traverse the suffix tree against the pattern.
• Step.2 : If you find pattern in suffix tree, then traverse the subtree
below that point and find all suffix indices on leaf nodes. All those
suffix indices will be pattern indices in string.
Suffix Arrays
A suffix array is a sorted array of all suffixes of a given string.
A suffix array can be constructed from Suffix tree by doing a DFS
traversal of the suffix tree.
Suffix array and suffix tree both can be constructed from each other in
linear time.
Example
Let the given string be "banana".
Following are some famous problems where Suffix array can be used.
1) Pattern Searching
2) Finding the longest repeated substring
3) Finding the longest common substring
4) Finding the longest palindrome in a string
References
1. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest,
and Clifford Stein. Introduction to Algorithms, Third Edition. MIT Press
and McGraw-Hill, 2009. ISBN 0-262-03293-7. Chapter 32: String
Matching, pp. 985–1013.
2. Aho, Alfred V.; Hopcroft, John E.; Ullman, Jeffrey D. (1974), The
Design and Analysis of Computer Algorithms, Reading/MA: Addison-
Wesley, ISBN 0-201-00029-6.
3. Weiner, P. (1973), "Linear pattern matching algorithms" (PDF), 14th
Annual IEEE Symposium on Switching and Automata Theory, pp. 1–
11, doi:10.1109/SWAT.1973.13.
49