0% found this document useful (0 votes)
50 views

Module 06. String Algorithms Lecture 3-6

Uploaded by

Clash Of Clanes
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Module 06. String Algorithms Lecture 3-6

Uploaded by

Clash Of Clanes
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Module 06: String Algorithms

(Contd.)
-
Knuth Morris Pratt Algorithm
• The KMP matching algorithm uses degenerating property (pattern
having same sub-patterns appearing more than once in the pattern)
of the pattern and improves the worst case complexity to O(n).
• The basic idea behind KMP’s algorithm is: whenever we detect a
mismatch (after some matches), we already know some of the
characters in the text of the next window. We take advantage of this
information to avoid matching the characters that we know will
anyway match.
Matching Overview txt = "AAAAABAAABA"
pat = "AAAA"

We compare first window of txt with pat


txt = "AAAAABAAABA"
pat = "AAAA" [Initial position]
txt = "AAAAABAAABA"
pat = "AAAA" [Pattern shifted one position]
txt = "AAAAABAAABA"
pat = "AAAA" [Pattern shifted one position]
• KMP algorithm preprocesses pat[] and constructs an auxiliary lps[] of size m (same as
size of pattern) which is used to skip characters while matching.
• name lps indicates longest proper prefix which is also suffix..
• A proper prefix is prefix with whole string not allowed.
• For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”.
• Proper prefixes are “”, “A” and “AB”.
• Suffixes of the string are “”, “C”, “BC” and “ABC”.
• We search for lps in sub-patterns.
• For each sub-pattern pat[0..i] where i = 0 to m-1,
• lps[i] stores length of the maximum matching proper prefix which is also a suffix of the
sub-pattern pat[0..i]. lps[i] = the longest proper prefix of pat[0..i] which is also a suffix of
pat[0..i].
Algorithm: Use a value from lps[] to decide the next characters to be
matched. The idea is to not match a character that we know will
anyway match.
Hashing based approach (Rabin Karp) for pattern matching
Rabin-Karp
• The Rabin-Karp string searching algorithm calculates a hash value for the pattern,
and for each M-character subsequence of text to be compared.
• If the hash values are unequal, the algorithm will calculate the hash value for next
M-character sequence.
• If the hash values are equal, the algorithm will do a Brute Force comparison
between the pattern and the M-character sequence.
• In this way, there is only one comparison per text subsequence, and Brute Force is
only needed when hash values match.

12
Rabin-Karp Example
• Hash value of “AAAAA” is 37
• Hash value of “AAAAH” is 100

13
Rabin-Karp Algorithm
pattern is M characters long
hash_p=hash value of pattern
hash_t=hash value of first M letters in body of text
do
if (hash_p == hash_t)
brute force comparison of pattern and selected section of text
hash_t= hash value of next section of text, one character over
while (end of text)

14
Hash Function
• Let b be the number of letters in the alphabet. The text subsequence t[i .. i+M-1] is
mapped to the number

• Furthermore, given x(i) we can compute x(i+1) for the next


subsequence t[i+1 .. i+M] in constant time, as follows:

• In this way, we never explicitly compute a new value. We


simply adjust the existing value as we move over one
16
character.
Rabin-Karp Math Example
• Let’s say that our alphabet consists of 10 letters.
• our alphabet = a, b, c, d, e, f, g, h, i, j
• Let’s say that “a” corresponds to 1, “b” corresponds to 2 and so
on.
The hash value for string “cah” would be ...

3*100 + 1*10 + 8*1 = 318

17
Rabin-Karp Mods
• If M is large, then the resulting value (~bM) will be enormous. For this reason, we
hash the value by taking it mod a prime number q.
• The mod function is particularly useful in this case due to several of its inherent
properties:
[(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
• For these reasons:
h(i)=((t[i] bM-1 mod q) +(t[i+1] bM-2 mod q) + … +(t[i+M-1] mod q))mod q
h(i+1) =( h(i) b mod q
Shift left one digit
-t[i] bM mod q
Subtract leftmost digit
+t[i+M] mod q )
Add new rightmost digit
mod q
18
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function,
the hashed values of two different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the
number of characters in the larger body of text.
• It is always possible to construct a scenario with a worst case
complexity of O(MN). This, however, is likely to happen only if the
prime number used for hashing is small.

19
Data structures (Tries and compressed tries) for strings
Tries
• Tries is an efficient information reTrieval data structure.
• Tries can reduce search complexities to optimal limit (key length).
• Binary search tree can reduce retrieval time to M * log N, where M is
maximum string length and N is number of keys in tree.
• Using Tries, we can search the key in O(M) time, but space complexity
for tries can be its limitation.
• All the descendant node in tries have the same prefix, hence tries
also know as prefix trees.
Trie Node
// Trie node
struct TrieNode
{
struct TrieNode *children[ALPHABET_SIZE];
// isEndOfWord is true if the node
// represents end of a word
bool isEndOfWord;
};

• Every node of Trie consists of multiple branches.


• Each branch represents a possible character of keys.
• Last node of every key is marked as end of word node.
Example:
Insertion in Tries
• Every character of input key is inserted as an individual Trie node.
• children is an array of pointers/references to next level trie nodes.
• The key character acts as an index into the array children
• If the input key is new or an extension of existing key, construct non-
existing nodes of the key, and mark end of word for last node.
• If the input key is prefix of existing key in Trie, we simply mark the last
node of key as end of word.
• The key length determines Trie depth.
Example:

void insert(String s) {
for(every char in string s) { if(child node
belonging to current char is null) { child
node=new Node(); }
current_node=child_node; }
Mark isEndofWord }
Standard trie- {bear, bell, bid, bull, buy, sell,
stock, stop}
Compress Trie - obtained from standard trie by joining chains
of single nodes.
Searching in Tries
• Searching for a key is similar to insert operation
• compare the characters and move down.
• search can terminate due to end of string or lack of key in trie.
• In the former case, if the isEndofWord field of last node is true, then
the key exists in trie.
• In the second case, the search terminates without examining all the
characters of key, since the key is not present in trie.
Lecture 36-Suffix tree and Suffix array data structures and related
operation for string handling
Suffix Tree
• A Suffix Tree for a given text is a compressed trie for all suffixes of
the given text.
Example:
• Given words: {bear, bell, bid, bull, buy, sell, stock, stop}
Build a Suffix Tree for a given text
• 1) Generate all suffixes of given text.
2) Consider all suffixes as individual words and build a compressed
trie.
• example text “banana\0”

• Following are all suffixes of “banana\0”

• banana\0
• anana\0
• nana\0
• ana\0
• na\0
• a\0
• \0
Search in Suffix Tree
Example:
Example:
• Panamabananas$
Possible suffixes:
index suffix
0 Panamabananas$
1 anamabananas$
2 namabananas$
3 amabananas$
4 mabananas$
5 abananas$
6 bananas$
index suffix
7 ananas$
8 nanas$
9 anas$
10 as$
11 s$
12 $
Pattern searching in suffix tree
• Step.1 : check if the given pattern really exists in string, for this,
traverse the suffix tree against the pattern.
• Step.2 : If you find pattern in suffix tree, then traverse the subtree
below that point and find all suffix indices on leaf nodes. All those
suffix indices will be pattern indices in string.
Suffix Arrays
A suffix array is a sorted array of all suffixes of a given string.
A suffix array can be constructed from Suffix tree by doing a DFS
traversal of the suffix tree.
Suffix array and suffix tree both can be constructed from each other in
linear time.
Example
Let the given string be "banana".

<suffixes> <sorted suffixes>


0 banana 5a
1 anana Sort the Suffixes 3 ana
2 nana ---------------------> 1 anana
3 ana alphabetically 0 banana
4 na 4 na
5a 2 nana

• So the suffix array for "banana" is {5, 3, 1, 0, 4, 2}


Search a pattern using Suffix Array
• Preprocess the text and build a suffix array of the text
• Binary search can be applied to search the given pattern
Applications of Suffix Array

Following are some famous problems where Suffix array can be used.
1) Pattern Searching
2) Finding the longest repeated substring
3) Finding the longest common substring
4) Finding the longest palindrome in a string
References
1. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest,
and Clifford Stein. Introduction to Algorithms, Third Edition. MIT Press
and McGraw-Hill, 2009. ISBN 0-262-03293-7. Chapter 32: String
Matching, pp. 985–1013.
2. Aho, Alfred V.; Hopcroft, John E.; Ullman, Jeffrey D. (1974), The
Design and Analysis of Computer Algorithms, Reading/MA: Addison-
Wesley, ISBN 0-201-00029-6.
3. Weiner, P. (1973), "Linear pattern matching algorithms" (PDF), 14th
Annual IEEE Symposium on Switching and Automata Theory, pp. 1–
11, doi:10.1109/SWAT.1973.13.

49

You might also like