0% found this document useful (0 votes)
42 views

Advanced String Lecture

The document discusses string searching algorithms like the Rabin-Karp and KMP algorithms. The Rabin-Karp algorithm uses hashing to quickly filter text positions that cannot match the pattern. It calculates a rolling hash over a moving window. The KMP algorithm uses a prefix table to skip character comparisons after a mismatch by leveraging the fact that some text is already known not to match the pattern. It first builds the prefix table through preprocessing and then uses it to efficiently search for patterns in text.

Uploaded by

Yared Tegegn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Advanced String Lecture

The document discusses string searching algorithms like the Rabin-Karp and KMP algorithms. The Rabin-Karp algorithm uses hashing to quickly filter text positions that cannot match the pattern. It calculates a rolling hash over a moving window. The KMP algorithm uses a prefix table to skip character comparisons after a mismatch by leveraging the fact that some text is already known not to match the pattern. It first builds the prefix table through preprocessing and then uses it to efficiently search for patterns in text.

Uploaded by

Yared Tegegn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Advanced String Lecture

String Searching Algorithms


How would you normally try to check if there is a substring that matches the
pattern?

String: “abcdeacdoe”

Pattern: “bcd”
Rabin-Karp Algorithm
● AKA Karp–Rabin algorithm
● Uses hashing (more specifically Rolling Hash)
● Helps quickly filter out positions of the text that cannot match the pattern, and
then checks for a match at the remaining positions
What is Rolling hash?
● A rolling hash is a hash function where the input is hashed in a window that
moves through the input.
● Think of it as a wheel moving on an inclined plane
Steps followed in Rolling hash pattern matching
1. Calculate Hash for Pattern
2. Calculate hash for 1st window in substring
3. Repeat step 2 until we get to the end of the substring
Trivial Rolling Hash Drawbacks?
● What are the drawbacks of rolling hash?
Let’s consider the case

String: AABAABCABA
Rabin–Karp string search algorithm
This is a simple rolling hash function that only uses multiplications and additions

where a is a constant, and c1, c2,...,ck are the input characters

● We then return H(P) = H mod d

d is preferably a large prime number. Why?


Consider the string ABEDA

Character Values: A => 1, B => 2, …. Z => 26

Let’s choose a prime number 3


1st Window: ABE
A B E D A
1 * 3^0 + 2 * 3^1 + 5 * 3^2
1 + 6 + 45 = 52 hash value

2nd Window: BED

A B E D A

2*3^0 + 5*3^1 + 4*3^2


2 + 15 + 36 = 53 hash value

● There is a method of making this process of finding the hash more efficient!
Steps to follow in Rabin-Karp’s algorithm
1. Subtract the value of the character removed from the window from the old
hash
2. Divide the old hash value with the number we picked
3. Add the new character value in the window multiplied by the length of the
pattern - 1
● So in the previous example:
● 52 - (val(A)) = 51
● 51/(prime) = 17
● 17 + (val(D)*3^2) = 53
Practice:
String: “AABAACAADAABAABA”

Pattern = “AABA”

A A B A A C A A D A A B A A B A
Time Complexity
● Best-case? Worst-case?
● When is worst-case time complexity achieved?
● How to reduce worst-case scenarios?
KMP Pattern Matching
- Knuth-Morris-Pratt Algorithm
- The basic idea behind KMP’s algorithm is: whenever we detect a mismatch
(after some matches), we already know some of the characters in the text of
the next window
- KMP algorithm was the first linear time complexity algorithm for string
matching.
● KMP algorithm is used to find a "Pattern" in a "Text". This algorithm compares
character by character from left to right. But whenever a mismatch occurs, it
uses a preprocessed table called "Prefix Table" to skip characters comparison
while matching.
● Some times prefix table is also known as LPS Table. Here LPS stands for
"Longest proper Prefix which is also Suffix".
Steps for creating LPS Table (Preprocessing)
1. Define a one dimensional array with the size equal to the length of the
Pattern. (LPS[size])
2. Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
3. Compare the characters at Pattern[i] and Pattern[j].
4. If both are matched then set LPS[i] = j+1 and increment both i & j values by
one. Go to Step 3.
5. If both are not matched then check the value of variable 'j'. If it is '0' then set
LPS[i] = 0 and increment 'i' value by one, if it is not '0' then set i = LPS[j-1]. Go
to Step 3.
6. Repeat above steps until all the values of LPS[] are filled.
Example

String: abcdabca

Step 1 - Define a one dimensional array

a b c d a b c a
0 1 2 3 4 5 6 7
Step 2:

i, j
a b c d a b c a
0 1 2 3 4 5 6 7

● First one is always 0


Remaining Steps until i == len(pat):

j i
a b c d a b c a

0 1 2 3 4 5 6 7

0 0

● pat[i] != pat[j]
● I += 1
j i
a b c d a b c a
0 1 2 3 4 5 6 7

0 0 0
j i
a b c d a b c a
0 1 2 3 4 5 6 7

0 0 0 0
j i

a b c d a b c a
0 1 2 3 4 5 6 7

0 0 0 0 1

● Now pat[i] == pat[j]


● Pat[i] = j + 1
● j,i += 1
j i
a b c d a b c a
0 1 2 3 4 5 6 7

0 0 0 0 1 2
j i
a b c d a b c a
0 1 2 3 4 5 6 7

0 0 0 0 1 2 3
j i
a b c d a b c a

0 1 2 3 4 5 6 7

0 0 0 0 1 2 3 ?

● What happens in this case?


j (moved here) j i
a b c d a b c a

0 1 2 3 4 5 6 7

0 0 0 0 1 2 3 1

● As mentioned in step 5, pat[i] != pat[j]


● So in this case we check pat[j-1] value and j will go to this index
● Since pat[i] == pat[pat[j-1]], then pat[i] = pat[pat[j-1]] + 1
● Therefore j = 0 and pat[i] = 1
j i
a b c d a b c a

0 1 2 3 4 5 6 7

0 0 0 0 1 2 3 1

● Since i is now equal to size of pat, we stop


Practice making LPS
1. String = aabaabaaa
2. String = aaacaaaaac
● But this is only the first part (Pattern Preprocessing)
● Next the String Matching phase begins
String Matching Explained
1. Start comparing first character in pattern with the first character in the text
until there is a mismatch
2. If there is a mismatch at index = i then we check LPS[i-1] and compare ith
character in LPS with the next character in the text
3. If the value is different from 0 this means there is a suffix that is also a prefix
hence we don’t need to compare starting from the first character in the
pattern. This is the brilliance of KMP algorithm.
4. We continue this process until we find a match or our index runs out and
conclude we don’t have a match.
Text: abxabcabcaby

Pattern: abcaby

LPS of the pattern:


a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
I
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0

● What happens here since there is a mismatch?


i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0

● We check the LPS[j-1] which LPS[1] = 0


● This means next comparison is going start from 0th index in the LPS
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0

● Since LPS[j] != text[i], i += 1


i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0

● LPS[j] != text[i]
● What do we do next?
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0

● LPS[j-1] = 2. As a result j = 2.
● This is because “ab” is a prefix that is also a suffix. We don’t need to check from the
beginning.
● Since text[i] == LPS[jnew] both i and j move to next index.
● What if LPS[jnew] was different with text[i]?
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0
i
a b x a b c a b c a b y

j
a b c a b y

0 1 2 3 4 5

0 0 0 1 2 0

● We have found a match


Practice:

String: adsgwadsxdsgwadsgz

Pattern: dsgwadsgz
Questions
1. Longest Happy Prefix
2. Find the Index of the First Occurrence in a String

You might also like