0% found this document useful (0 votes)
13 views

String Matching

Uploaded by

pranshusahu862
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

String Matching

Uploaded by

pranshusahu862
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 5

String Matching Algorithms

String matching or pattern matching is the task of finding a substring (also known as a "pattern")
within another string (the "text"). The problem is central to various applications in computer
science, including text search engines, bioinformatics, and data processing.

1.1 Introduction to String Matching


String matching involves identifying whether a given pattern exists within a larger string and, if it
does, determining the position of the match. The problem of searching for a pattern in a text can be
generalized as: given a string T of length n and a pattern P of length m, find all occurrences of P in
T.
Applications:
• Search Engines: Search for keywords in documents.
• Bioinformatics: Match DNA sequences or protein patterns.
• Text Processing: Tasks such as text indexing, spell checkers, and plagiarism detection.

1.2 Naive String Matching Algorithm


The naive approach is the most straightforward but least efficient for string matching. It involves
checking all possible positions in the text where the pattern can fit, and at each position, checking if
the substring of the text matches the pattern.
Algorithm Steps:
1. Start from the first character of the text.
2. Compare the substring of the text starting at that position with the pattern.
3. If a match is found, record the position.
4. Move one character forward and repeat until the end of the text.
Time Complexity: The worst-case time complexity is O(n×m), where n is the length of the text and
m is the length of the pattern.
Python Example:
python
Copy code
def naive_search(text, pattern):
n = len(text)
m = len(pattern)
for i in range(n - m + 1):
if text[i:i+m] == pattern:
print(f"Pattern found at index {i}")

Pros:
• Simple to implement.
• Works well for small texts or patterns.
Cons:
• Inefficient for larger texts, as it checks all possible positions.
1.3 Knuth-Morris-Pratt (KMP) Algorithm
The KMP algorithm improves upon the naive approach by using information from previous
comparisons to avoid unnecessary re-checking of characters. The core idea is to preprocess the
pattern to create a partial match table (also known as the "prefix table"), which helps to skip over
portions of the text that have already been checked.
Key Idea: The idea is to avoid rechecking characters that are known to match. When a mismatch
occurs, the algorithm uses the prefix table to shift the pattern appropriately.
Algorithm Steps:
1. Preprocess the pattern: Create a table that stores the length of the longest proper prefix of
the pattern that is also a suffix.
2. Search: Compare the pattern with the text. If a mismatch occurs and the current matched
length is greater than 0, shift the pattern according to the prefix table, otherwise move the
pattern one step forward.
Time Complexity: The time complexity of the KMP algorithm is O(n+m), which is more efficient
than the naive approach.
Python Example:
python
Copy code
def KMP_search(text, pattern):
def compute_prefix_table(pattern):
m = len(pattern)
lps = [0] * m
length = 0
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps

n = len(text)
m = len(pattern)
lps = compute_prefix_table(pattern)
i = 0
j = 0
while i < n:
if pattern[j] == text[i]:
i += 1
j += 1
if j == m:
print(f"Pattern found at index {i - j}")
j = lps[j - 1]
elif i < n and pattern[j] != text[i]:
if j != 0:
j = lps[j - 1]
else:
i += 1

Pros:
• Much faster than the naive approach.
• Avoids redundant comparisons.
Cons:
• Requires preprocessing, which takes O(m) time.
• Slightly more complex to implement.

1.4 Boyer-Moore Algorithm


The Boyer-Moore algorithm is one of the most efficient string matching algorithms, especially for
large texts. It improves searching by utilizing two heuristics:
1. Bad Character Heuristic: If a mismatch occurs, the algorithm shifts the pattern so that the
mismatched character aligns with its rightmost occurrence in the pattern.
2. Good Suffix Heuristic: If a mismatch occurs, the algorithm shifts the pattern based on
previously matched suffixes.
Key Idea: Boyer-Moore searches for the pattern in the text from right to left, unlike other
algorithms that check from left to right. The idea is to skip as many characters as possible when a
mismatch occurs by using the aforementioned heuristics.
Time Complexity: The average case time complexity is O(n/m), where n is the length of the text
and m is the length of the pattern, but in the worst case, it is O(n×m).
Python Example:
python
Copy code
def Boyer_Moore(text, pattern):
m = len(pattern)
n = len(text)

# Preprocessing the bad character rule


bad_char = {}
for i in range(m):
bad_char[pattern[i]] = i

i = 0
while i <= n - m:
j = m - 1
while j >= 0 and pattern[j] == text[i + j]:
j -= 1
if j < 0:
print(f"Pattern found at index {i}")
i += (m - bad_char.get(text[i + m], -1) if i + m < n else 1)
else:
i += max(1, j - bad_char.get(text[i + j], -1))

Pros:
• Very efficient in practice, especially for large patterns.
• Preprocessing takes linear time, and the search is very fast.
Cons:
• More complicated than other algorithms.
• Can have poor worst-case performance, though rare.

1.5 Rabin-Karp Algorithm


The Rabin-Karp algorithm uses hashing to quickly identify possible matches. It computes the hash
of the pattern and the hash of every substring of the text with the same length as the pattern. If a
hash matches, a direct comparison is done to verify the match.
Key Idea: By comparing hash values, Rabin-Karp avoids direct string comparison in many cases,
which can lead to faster results. However, there can be hash collisions, so further verification is
necessary when a hash match occurs.
Time Complexity: The average time complexity is O(n+m), but in the worst case, it can degrade to
O(n×m) due to hash collisions.
Python Example:
python
Copy code
def Rabin_Karp(text, pattern):
d = 256 # number of characters in the input alphabet
q = 101 # prime number to mod hash values

m = len(pattern)
n = len(text)
pattern_hash = 0
text_hash = 0
h = 1

# Calculate the value of d^(m-1) % q


for i in range(m-1):
h = (h * d) % q

# Calculate initial hash values for the pattern and text


for i in range(m):
pattern_hash = (d * pattern_hash + ord(pattern[i])) % q
text_hash = (d * text_hash + ord(text[i])) % q

# Search for the pattern


for i in range(n - m + 1):
if pattern_hash == text_hash:
if text[i:i+m] == pattern:
print(f"Pattern found at index {i}")
if i < n - m:
text_hash = (d * (text_hash - ord(text[i]) * h) + ord(text[i+m])) %
q
if text_hash < 0:
text_hash += q

Pros:
• Efficient for multiple pattern searches (can be extended for multiple patterns).
• Hashing speeds up the search.
Cons:
• Can suffer from hash collisions, leading to additional comparisons.
• Requires careful handling of hashing and collisions.

You might also like