String Matching
String Matching
String matching or pattern matching is the task of finding a substring (also known as a "pattern")
within another string (the "text"). The problem is central to various applications in computer
science, including text search engines, bioinformatics, and data processing.
Pros:
• Simple to implement.
• Works well for small texts or patterns.
Cons:
• Inefficient for larger texts, as it checks all possible positions.
1.3 Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm improves upon the naive approach by using information from previous
comparisons to avoid unnecessary re-checking of characters. The core idea is to preprocess the
pattern to create a partial match table (also known as the "prefix table"), which helps to skip over
portions of the text that have already been checked.
Key Idea: The idea is to avoid rechecking characters that are known to match. When a mismatch
occurs, the algorithm uses the prefix table to shift the pattern appropriately.
Algorithm Steps:
1. Preprocess the pattern: Create a table that stores the length of the longest proper prefix of
the pattern that is also a suffix.
2. Search: Compare the pattern with the text. If a mismatch occurs and the current matched
length is greater than 0, shift the pattern according to the prefix table, otherwise move the
pattern one step forward.
Time Complexity: The time complexity of the KMP algorithm is O(n+m), which is more efficient
than the naive approach.
Python Example:
python
Copy code
def KMP_search(text, pattern):
def compute_prefix_table(pattern):
m = len(pattern)
lps = [0] * m
length = 0
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
n = len(text)
m = len(pattern)
lps = compute_prefix_table(pattern)
i = 0
j = 0
while i < n:
if pattern[j] == text[i]:
i += 1
j += 1
if j == m:
print(f"Pattern found at index {i - j}")
j = lps[j - 1]
elif i < n and pattern[j] != text[i]:
if j != 0:
j = lps[j - 1]
else:
i += 1
Pros:
• Much faster than the naive approach.
• Avoids redundant comparisons.
Cons:
• Requires preprocessing, which takes O(m) time.
• Slightly more complex to implement.
i = 0
while i <= n - m:
j = m - 1
while j >= 0 and pattern[j] == text[i + j]:
j -= 1
if j < 0:
print(f"Pattern found at index {i}")
i += (m - bad_char.get(text[i + m], -1) if i + m < n else 1)
else:
i += max(1, j - bad_char.get(text[i + j], -1))
Pros:
• Very efficient in practice, especially for large patterns.
• Preprocessing takes linear time, and the search is very fast.
Cons:
• More complicated than other algorithms.
• Can have poor worst-case performance, though rare.
m = len(pattern)
n = len(text)
pattern_hash = 0
text_hash = 0
h = 1
Pros:
• Efficient for multiple pattern searches (can be extended for multiple patterns).
• Hashing speeds up the search.
Cons:
• Can suffer from hash collisions, leading to additional comparisons.
• Requires careful handling of hashing and collisions.