DAA Unit 5 Part 1
DAA Unit 5 Part 1
String Matching
• String Matching Algorithm is also called "String Searching Algorithm.“
• This is a vital class of string algorithm is declared as "this is the
method to find a place where one is several strings are found within
the larger string.“
• Given a text array, T [1.....n], of n character and a pattern array, P
[1......m], of m characters. The problems are to find an integer s,
called valid shift where 0 ≤ s < n-m and T [s+1......s+m] = P [1......m].
• In other words, to find even if P in T, i.e., where P is a substring of T.
The item of P and T are character drawn from some finite alphabet
such as {0, 1} or {A, B .....Z, a, b..... z}.
String Matching
• Given a string T [1......n], the substrings are represented as T [i......j] for some 0≤i ≤ j≤n-1,
the string formed by the characters in T from index i to index j, inclusive. This process that
a string is a substring of itself (take i = 0 and j =m).
Algorithms used for String Matching:
• There are different types of method is used to finding the string
• The Naive String Matching Algorithm
• The Rabin-Karp-Algorithm
• Finite Automata
• The Knuth-Morris-Pratt Algorithm
• The Boyer-Moore Algorithm
The Naive String Matching Algorithm
• The naive approach tests all the possible placement of Pattern P [1.......m] relative to text
T [1......n]. We try shift s = 0, 1.......n-m, successively and for each shift s. Compare T
[s+1.......s+m] to P [1......m].
• The naive algorithm finds all valid shifts using a loop that checks the condition P [1.......m]
= T [s+1.......s+m] for each of the n - m +1 possible value of s.
Naïve String Matching
Example:
Suppose T = 1011101110
P = 111
Find all the Valid Shift
Naive String Matching
Naive String Matching
Naive String Matching
Rabin-Karp String Matching
• The Rabin-Karp algorithm is a string matching algorithm that efficiently finds patterns
within a larger text using a technique called hashing.
• The basic idea is to convert the pattern and each possible substring of the text into
numeric values (hashes) and then compare these values rather than the strings
themselves. This allows for faster comparisons, especially when dealing with large texts.
• The Rabin-Karp algorithm for string matching is useful because it can quickly find patterns
in large texts. It’s especially good when you need to search for multiple patterns at once
or when the text is very long.
Rabin-Karp String Matching - Working
• Hashing the Pattern: First, the algorithm turns the pattern (the word or sequence
you’re looking for) into a number using a hash function. This number is like a
unique ID for the pattern.
• Hashing the Text: Next, the algorithm takes the first part of the text that is the
same length as the pattern and turns it into a number using the same hash
function.
• Comparing Hashes: The algorithm compares the number (hash) of the pattern
with the number (hash) of the part of the text. If the numbers are the same, it
means the pattern might be there. If the numbers are different, the pattern is
definitely not there.
• Sliding the Window: If the numbers don't match, the algorithm slides over to the
next part of the text, creates a new hash for this part, and compares again. This
sliding continues until the whole text is checked.
• Collision Check: Sometimes, different parts of the text can have the same hash
number even if they are not the same as the pattern. When the hash numbers
match, the algorithm checks the actual text to make sure it really found the
pattern.
Rabin-Karp String Matching - Key Concepts
1. Hash Function
• A hash function is a mathematical function that converts a string (like a sequence of
characters) into a fixed-size numeric value, which is called a hash.
• The hash function takes a string as input and returns a unique number (hash) for that string.
• By converting strings into numbers, the Rabin-Karp algorithm can compare the pattern with
parts of the text much faster. Instead of comparing the strings character by character, it just
compares their hash values. If the hash values match, then the strings are likely to match as
well.
Rabin-Karp String Matching - Key Concepts
2. Sliding Window Technique
• The sliding window technique is a method used by the Rabin-Karp algorithm to move
through the text efficiently while updating the hash value.
How It Works:
• Imagine you have a window that covers a substring of the text that is the same length as the
pattern. You start by calculating the hash for the first substring within this window. Then, you
slide the window one character to the right to cover the next substring.
• Instead of recalculating the hash from scratch for each new substring, the algorithm uses the
hash of the previous substring and updates it by removing the effect of the first character and
adding the effect of the new character.
• For example, if your current window covers "abc" (with a hash of 294), and you move the
window to cover "bcd", you can quickly update the hash by subtracting the effect of 'a' and
adding the effect of 'd'.
• This technique reduces the number of operations needed to calculate the hash for each new
substring, making the algorithm much faster. Instead of recalculating the hash for every
substring from scratch, you only need to make a few adjustments, which is much quicker.
Rabin-Karp String Matching - Algorithm
Rabin-Karp String Matching - Example
Let the text be
Let us assign a numerical value(v)/weight for the characters we will be using in the
problem. Here, we have taken first ten alphabets only (i.e. A to J).
Rabin-Karp String Matching - Example
n be the length of the pattern and m be the length of the text.
Here, m = 10 and n = 3.
Let d be the number of characters in the input set.
Here, we have taken input set {A, B, C, ..., J}.
So, d = 10. You can assume any suitable value for d.
We have chosen a prime number (here, 13) in such a way that we can perform all the
calculations with single-precision arithmetic.
Rabin-Karp String Matching - Example
Calculate the hash value for the text-window of size m.
For the first window ABC, hash value for text(t)
= Σ(v * dn-1) mod 13
= ((1 * 102) + (2 * 101) + (3 * 100)) mod 13
= 123 mod 13
=6
• Compare the hash value of the pattern with the hash value of the text. If they match then,
character-matching is performed.
• In the above examples, the hash value of the first window (i.e. t) matches with p so, go for
character matching between ABC and CDD.
• Since they do not match so, go for the next window.
Spurious Hit
• When the hash value of the pattern matches with the hash value of a window of the text but
the window is not the actual pattern then it is called a spurious hit.
Rabin-Karp String Matching - Example
We calculate the hash value of the next window by subtracting the first term and adding the
next term as shown below.
t = ((2 * 102) + (3* 101) + (3 * 100)) mod 13
= 233 mod 13
= 12
Knuth-Morris-Pratt Algorithm
• KMP algorithm was invented by Donald Knuth and Vaughan Pratt together and
independently by James H Morris in the year 1970. In the year 1977, all the three jointly
published KMP Algorithm.
• KMP algorithm is used to find a "Pattern" in a "Text". This algorithm campares character by
character from left to right.
• But whenever a mismatch occurs, it uses a preprocessed table called "Prefix Table" to skip
characters comparison while matching.
• Some times prefix table is also known as LPS Table. Here LPS stands for "Longest proper
Prefix which is also Suffix".
Knuth-Morris-Pratt Algorithm
Steps for Creating LPS Table (Prefix Table)
• Step 1 - Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
• Step 2 - Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
• Step 3 - Compare the characters at Pattern[i] and Pattern[j].
• Step 4 - If both are matched then set LPS[j] = i+1 and increment both i & j values by one.
Goto to Step 3.
• Step 5 - If both are not matched then check the value of variable 'i'. If it is '0' then set LPS[j]
= 0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto Step 3.
• Step 6- Repeat above steps until all the values of LPS[] are filled.
Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm
How to use LPS Table
• We use the LPS table to decide how many characters are to be skipped for comparison
when a mismatch has occurred.
• When a mismatch occurs, check the LPS value of the previous character of the mismatched
character in the pattern.
• If it is '0' then start comparing the first character of the pattern with the next character to the
mismatched character in the text.
• If it is not '0' then start comparing the character which is at an index value equal to the LPS
value of the previous character to the mismatched character in pattern with the mismatched
character in the Text.
Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm
Knuth-Morris-Pratt Algorithm