0% found this document useful (0 votes)
22 views23 pages

ALo 2

The document provides an overview of string searching and pattern matching, highlighting their importance in computer science and various applications such as text processing, search engines, and bioinformatics. It discusses several algorithms, including Naive Search, KMP, and Boyer-Moore, detailing their definitions, advantages, disadvantages, time and space complexities. Additionally, it addresses challenges in string searching, such as performance efficiency and handling complex patterns.

Uploaded by

sidakvohra7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views23 pages

ALo 2

The document provides an overview of string searching and pattern matching, highlighting their importance in computer science and various applications such as text processing, search engines, and bioinformatics. It discusses several algorithms, including Naive Search, KMP, and Boyer-Moore, detailing their definitions, advantages, disadvantages, time and space complexities. Additionally, it addresses challenges in string searching, such as performance efficiency and handling complex patterns.

Uploaded by

sidakvohra7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Introduction to Searching and Pattern Matching in Strings

Searching and pattern matching are fundamental operations in computer


science and form the backbone of many applications in text and data
processing. These operations are concerned with finding specific
substrings (also called patterns) within a larger string (also referred to as
the text) and determining whether these substrings exist and where they
occur. The need for efficient searching and pattern matching algorithms is
critical in real-world applications, especially when dealing with large
datasets, complex documents, or high-speed systems.
Here are the names of the algorithms related to searching for a substring
in a string:
1. Linear Search (Naive Search)
2. Knuth-Morris-Pratt (KMP) Algorithm
3. Boyer-Moore Algorithm
4. Rabin-Karp Algorithm
5. Z Algorithm
1. String Searching: String searching is the process of locating a
specific substring or pattern within a larger string (referred to as the "text").
In simpler terms, it's the task of finding whether a particular string exists in
another string and, if so, where it appears.
Examples of string searching:
• Searching for a specific word in a text document (e.g., searching for
"computer" in an article).
• Searching for a DNA sequence in bioinformatics (e.g., identifying a
gene pattern in a long strand of DNA).
2. Pattern Matching: Pattern matching is an extension of string
searching that not only finds the first occurrence of a pattern but also
identifies all occurrences of that pattern in the text. It can involve simple
patterns (like specific words) or complex patterns that use wildcard
characters, regular expressions, or other matching criteria.
A pattern can be:
• A single character: e.g., finding all occurrences of the character 'a'.
• A fixed string: e.g., finding every instance of the word "apple" in a
document.
• A more complex pattern: e.g., using regular expressions to match
multiple variations (e.g., all email addresses in a document, or any
sequence of digits).
3. Applications of String Searching and Pattern Matching
String searching and pattern matching algorithms are crucial in many
fields. Here are some common applications:
a) Text Processing: Searching for words, phrases, or keywords within
documents or databases. For example, in word processors, searching for
a term and replacing it is a common operation.
b) Search Engines: When a user queries a search engine (like Google),
the engine performs pattern matching to index and retrieve relevant
results from a large corpus of web pages.
c) Data Validation: Pattern matching is widely used for validating and
extracting specific formats, such as email addresses, phone numbers,
dates, or URLs using regular expressions (regex).
d) Bioinformatics: In bioinformatics, pattern matching is essential for
locating specific sequences of DNA, RNA, or protein structures. For
instance, matching patterns to find specific genes or mutations within a
genome.
e) Compiler Construction: During the process of lexical analysis in
compilers, pattern matching helps identify programming language tokens
(keywords, identifiers, operators) by matching patterns in the source code.
f) Spell Checking: Spell checkers compare words in a document against
a dictionary. They use pattern matching algorithms to identify and suggest
corrections for misspelled words.
4. Challenges in String Searching
Efficient searching becomes critical when dealing with large texts, such
as multi-gigabyte files, databases, or large datasets. For example,
searching through huge documents, bioinformatics sequences, or web
pages could be slow using basic search algorithms, making optimization
necessary.
Challenges include:
1. Performance and Efficiency: Searching large datasets requires
highly efficient algorithms to avoid slow performance.
o For large datasets or documents, using algorithms that
perform suboptimal comparisons can be computationally
expensive.
o Efficient string matching algorithms minimize unnecessary
character comparisons, thus speeding up the search.
2. Multiple Occurrences of a Pattern: When multiple occurrences of
a pattern exist, finding all of them quickly can be tricky. Some
algorithms efficiently find all matches, while others may stop at the
first one.
3. Pattern Complexity: Patterns may vary in complexity from simple
fixed substrings to more complex expressions using wildcards,
regular expressions, or custom symbols. These patterns require
algorithms that can handle both simple and complex patterns.
4. Memory Usage: Some algorithms use a lot of memory to store
auxiliary data structures (such as tables or preprocessed arrays),
which can be a limiting factor when dealing with large texts.
5. Approximate Matching: In some applications, the text might
contain errors (such as typos, mutations, or missing data).
Approximate matching or fuzzy matching algorithms are needed to
handle such cases.
5. Features of String Searching and Pattern Matching Algorithms
The following features define the efficiency and application of different
string searching and pattern matching algorithms:
a) Time Complexity: The time complexity of an algorithm defines how
efficiently it searches through a text for a pattern. For example:
o Naive search can have a time complexity of O(n×m)O(n \times
m), where n is the length of the text and m is the length of the
pattern.
o Advanced algorithms like KMP and Boyer-Moore have better
time complexities, with O(n+m)O(n + m) or O(n/m)O(n/m),
respectively, offering much faster searches in practical use
cases.
b) Space Complexity: Some algorithms require additional space to store
preprocessed data, such as arrays or tables. For example, the KMP
algorithm requires space for the LPS (Longest Prefix Suffix) array. These
extra memory requirements can be a consideration for large-scale
applications.
c) Handling of Multiple Patterns: Some algorithms, such as Aho-
Corasick, are designed to search for multiple patterns simultaneously,
making them ideal for use cases like searching for many keywords in a
text or detecting multiple virus signatures in a DNA sequence.
d) Flexibility of Matching: Regular expressions (regex) provide high
flexibility and power for pattern matching. They can match simple
substrings or complex patterns with wildcards, character classes, and
quantifiers.
e) Approximate Matching: Fuzzy matching algorithms (like
Levenshtein distance) allow for finding patterns that may have errors
(e.g., spelling mistakes, mutations). This is useful in applications like spell
checkers and DNA sequence comparison.

1. ( Linear Search) Naive Pattern Searching Algorithm


Definition:
The Naive Pattern Searching algorithm is the simplest and most
straightforward method for finding a substring (pattern) within a larger
string (text). The algorithm works by checking all possible positions in the
text where the pattern could match and compares each character of the
pattern to the corresponding character in the text.
In other words, the algorithm tries to match the pattern starting at every
index in the text and checks if the characters of the pattern match the
corresponding characters in the text. If a complete match is found, it
returns the index where the pattern occurs; otherwise, it continues
searching.
Explanation:
The Naive Pattern Searching algorithm proceeds as follows:
1. Start at the first position of the text:
o Compare the first character of the pattern with the first
character of the text.
o Continue comparing subsequent characters of the pattern
with the corresponding characters in the text.
o If all characters of the pattern match consecutively, a match is
found.
2. Move one position forward in the text:
o After trying a pattern starting at the first index, move one
position forward in the text.
o Compare the pattern again starting from this new position in
the text.
3. Repeat the process for all positions in the text where the pattern
could potentially fit.
4. Return the index(es) where a complete match is found.
5. If no match is found after checking all possible positions, the search
is considered unsuccessful.
Example:
Let’s consider a text T = "ABABDABACDABABCABAB" and a pattern P
= "ABAB".
Steps for Naive Algorithm:
1. Start at the first position of the text and compare the first character
of the pattern ("A") with the first character of the text ("A").
2. If the first characters match, compare the second character of the
pattern ("B") with the second character of the text ("B").
3. Continue comparing the pattern to the text until a mismatch is found
or the entire pattern matches.
4. If a mismatch is found or the end of the pattern is reached, shift the
pattern by one character in the text and repeat the process.
5. If a complete match is found, report the position; otherwise, continue
checking.
In this example, the matches happen at:
• Position 0: "ABAB" matches "ABAB".
• Position 10: "ABAB" matches "ABAB".
• Position 15: "ABAB" matches "ABAB".
Time Complexity:
The time complexity of the Naive Pattern Searching Algorithm is:
• Best case: O(n)O(n), where n is the length of the text. This occurs
when the pattern is found very early in the text.
• Worst case: O(n×m)O(n \times m), where n is the length of the text
and m is the length of the pattern. This occurs when the pattern does
not match the text, and every possible position in the text is
compared with the pattern.
Space Complexity:
The space complexity is O(1)O(1) because the algorithm uses only a
constant amount of extra space regardless of the size of the text or the
pattern.
Advantages:
• Simplicity: The Naive Pattern Searching algorithm is easy to
implement and understand.
• No preprocessing: It does not require any preprocessing of the text
or pattern (unlike other algorithms like KMP or Boyer-Moore).
Disadvantages:
• Inefficiency: The algorithm can be slow when the text is long and
there are many mismatches because it checks every possible
position in the text, leading to redundant comparisons.
• No skipping: The Naive algorithm does not take advantage of any
previous information when a mismatch occurs, unlike more
advanced algorithms that use heuristics to skip sections of the text
(e.g., Boyer-Moore or KMP).
2. Knuth-Morris-Pratt (KMP) Algorithm
Definition:
The Knuth-Morris-Pratt (KMP) algorithm is an efficient string-matching
algorithm that improves upon the naive approach by avoiding redundant
comparisons. It preprocesses the pattern to create an array (known as the
Longest Prefix Suffix (LPS) array) that allows the algorithm to skip certain
positions in the text, thereby reducing unnecessary comparisons and
achieving better performance.
Explanation:
The KMP algorithm works by searching for a substring in a larger string
(text) in two phases:
1. Preprocessing Phase:
o The algorithm first preprocesses the pattern to construct the
LPS (Longest Prefix Suffix) array. This array contains
information about the longest prefix of the pattern that is also
a suffix for every prefix of the pattern.
o The LPS array helps to determine how many characters can
be skipped when a mismatch occurs during the search.
2. Search Phase:
o The algorithm then performs the search on the text. When a
mismatch occurs between the text and the pattern, instead of
shifting the pattern by just one character (like in the Naive
algorithm), it uses the information from the LPS array to shift
the pattern by more than one character, based on previously
matched characters.
Steps:
1. Preprocess the Pattern:
o Generate the LPS array for the given pattern, which will store
the length of the longest prefix that is also a suffix for each
prefix of the pattern.
2. Search the Text:
o Use the LPS array to skip over portions of the text where a
mismatch occurred, avoiding redundant comparisons.
Time Complexity:
• Preprocessing Phase: O(m)O(m), where m is the length of the
pattern.
• Search Phase: O(n)O(n), where n is the length of the text.
• Total Time Complexity: O(n+m)O(n + m).
Space Complexity:
• The space complexity is O(m)O(m) because the algorithm requires
extra space for the LPS array.
Advantages:
• Efficient: KMP is faster than the Naive algorithm, especially for large
texts.
• Avoids Redundant Comparisons: The use of the LPS array
ensures that we skip over positions in the text that have already been
compared.
• Linear Time Complexity: It achieves linear time complexity, making
it suitable for large-scale searches.
Disadvantages:
• Preprocessing Overhead: The need to preprocess the pattern may
be considered a disadvantage if the pattern is short or if pattern
matching needs to be done infrequently.
• Complex Implementation: KMP is more complex to implement
compared to simpler algorithms like Naive Pattern Searching.
3. Boyer-Moore Algorithm
Definition:
The Boyer-Moore algorithm is one of the most efficient algorithms for
substring searching. It improves upon the Naive Pattern Searching
algorithm by skipping over portions of the text that cannot possibly match
the pattern. It does this by using heuristics that provide guidance on how
far the pattern can be shifted when mismatches occur. The two main
heuristics used in the Boyer-Moore algorithm are:
1. Bad Character Rule
2. Good Suffix Rule
Explanation:
1. Bad Character Rule:
o When a mismatch occurs between the pattern and the text,
the Bad Character Rule tells us how far we can shift the pattern
to the right.
o It uses the information about the most recent character in the
pattern and how far we can shift based on its last occurrence
in the pattern.
2. Good Suffix Rule:
o If a mismatch happens after some characters of the pattern
have matched the text, the Good Suffix Rule tries to use
information from the matched part of the pattern to find a
better position for the pattern shift.
The Boyer-Moore algorithm can skip portions of the text entirely, making
it much faster than the naive search algorithm in practice.
Steps:
1. Preprocessing:
o Construct two tables (or arrays): one for the Bad Character
Heuristic and another for the Good Suffix Heuristic.
2. Search Phase:
o Start aligning the pattern with the beginning of the text.
o Compare the pattern with the text from right to left.
o If a mismatch occurs, apply the Bad Character Rule or Good
Suffix Rule to shift the pattern.
o Repeat the process until the pattern is found or the text has
been fully searched.
Time Complexity:
• Worst-case time complexity: O(n×m)O(n \times m), where n is the
length of the text and m is the length of the pattern (in the case of a
very poor choice of characters).
• Best-case time complexity: O(n/m)O(n / m) (e.g., when the pattern
does not appear in the text or when the text is long and the pattern
shifts efficiently).
• Average-case time complexity: O(n/m)O(n / m) (due to efficient
skipping and shifting).
Space Complexity:
• The space complexity is O(m+k)O(m + k), where m is the length of
the pattern, and k is the size of the alphabet (for the Bad Character
table).
Advantages:
• Efficient Skipping: The Boyer-Moore algorithm skips over parts of
the text that do not need to be checked, which often results in much
faster search times compared to simpler algorithms like Naive or
KMP.
• Good Performance on Large Texts: It performs well on large texts
because it often skips large sections of the text.
• Flexible: The algorithm works efficiently for large alphabets,
especially when used in practical applications.
Disadvantages:
• Preprocessing Overhead: The preprocessing step can take time,
especially for large patterns, although it is done only once.
• Complex Implementation: The algorithm is more complex to
implement than simpler algorithms like Naive Search or KMP.
• Worst-case Scenario: The worst-case time complexity is still
O(n×m)O(n \times m), which can happen with specific patterns and
text (though this is rare in practice).
Explanation of the Code:
• The Bad Character Heuristic function creates a dictionary where
each character in the pattern points to its last occurrence index in
the pattern.
• During the Boyer-Moore Search, we start comparing the pattern
with the text from the rightmost character. If a mismatch occurs, the
pattern is shifted based on the Bad Character Rule, where the shift
amount depends on the last occurrence of the mismatched
character.
4 Rabin-Karp Algorithm
Definition:
The Rabin-Karp Algorithm is a string-searching algorithm that uses a
hashing technique to efficiently search for a pattern within a text. It works
by calculating the hash values of substrings of the text and comparing
them with the hash value of the pattern. If the hash values match, a detailed
character-by-character comparison is done to verify the match.
The Rabin-Karp Algorithm is efficient when searching for multiple
patterns in a single text or when there are many possible matches, as it
avoids re-comparing the entire text for every pattern.
Explanation:
1. Hashing the Pattern and Substrings:
o The first step is to compute a hash value for the pattern.
o The algorithm then computes hash values for all substrings of
the text that are of the same length as the pattern.
2. Comparing Hash Values:
o The algorithm compares the hash value of the current
substring in the text with the hash value of the pattern.
o If the hash values match, a character-by-character
comparison is done to verify the match (to account for hash
collisions).
3. Efficient Hash Calculation:
o The Rabin-Karp algorithm uses rolling hash for efficient
computation of hash values for substrings as we slide through
the text.
Steps:
1. Preprocessing:
o Compute the hash value for the pattern.
o Compute the hash values for all substrings of the same length
as the pattern in the text.
2. Search Phase:
o For each substring, compare its hash value with the hash value
of the pattern.
o If the hash values match, perform a character-by-character
comparison to confirm if the substring and the pattern are
identical.
Time Complexity:
• Worst-case time complexity: O(n×m)O(n \times m), where n is the
length of the text and m is the length of the pattern. This happens
when there are hash collisions, and the algorithm has to perform
character-by-character comparisons for each match.
• Average-case time complexity: O(n + m), if hash collisions are rare
and we quickly identify matches with hash values.
• Space Complexity: O(1) for the Rabin-Karp algorithm (excluding
the space for the text and pattern).
Advantages:
• Efficient for Multiple Pattern Search: It’s particularly effective
when you need to search for multiple patterns at once because you
can compute the hashes of the patterns in advance.
• Flexible: Works well when there are multiple potential matches.
Disadvantages:
• Hash Collisions: In rare cases, two different substrings may have
the same hash value (a hash collision). This necessitates a
character-by-character comparison to check if the substrings are
identical.
• Preprocessing Overhead: Although it’s efficient in searching,
calculating hash values for every substring and pattern requires
some overhead.
Explanation of the Code:
• The rabin_karp_search function implements the Rabin-Karp
algorithm to search for a pattern within a given text.
• It first calculates the hash values of the pattern and the initial
substring of the text that has the same length as the pattern.
• The hash values are then compared. If a match is found, a
character-by-character comparison is performed to ensure there
is no hash collision.
• The rolling hash technique is used to efficiently calculate the hash
values of the substrings of the text as the pattern slides through the
text.
5 Z Algorithm
Definition:
The Z Algorithm is an efficient string-matching algorithm used to solve
pattern searching and string matching problems. It computes the Z-array
of a string, which is an array where each element Z[i]Z[i] represents the
length of the longest substring starting from the index ii that matches the
prefix of the string. This algorithm helps in quickly identifying the positions
of a pattern in a text.
The Z Algorithm can be used for:
• Pattern matching (finding occurrences of a pattern in a text).
• String comparison.
• Finding repeated substrings.
• String parsing tasks.
Explanation:
The Z-array for a string SS of length n is an array of the same length n,
where each Z[i] denotes the length of the longest substring starting from
position ii that is also a prefix of the string S.
Steps to Compute Z-array:
1. Initialization:
o Set Z[0]=n because the entire string matches itself as a prefix.
o Define two variables l (left) and r (right) to represent the current
range of the substring being considered that matches the
prefix of the string.
2. Iterate Over the String:
o For each index ii from 1 to n−1:
▪ If ii is outside the current window [l,r], start a fresh
comparison of the substring starting from ii with the
prefix.
▪ If ii is inside the window, use previously computed values
to avoid redundant comparisons (taking advantage of
the previously matched portions).
o Update the l and r values accordingly.
3. Pattern Matching:
o Once the Z-array is computed, you can find occurrences of a
pattern P in the text T by concatenating PP with T (separated
by a special character) and calculating the Z-array for this
concatenated string. Any Z-value that equals the length of the
pattern corresponds to a match at the respective position in
the text.
Time Complexity:
• Time Complexity: O(n + m), where n is the length of the string and
mm is the length of the pattern (in case of pattern matching). This is
because the Z-algorithm computes the Z-array in linear time.
• Space Complexity: O(n), where n is the length of the string (space
is used for storing the Z-array).
Advantages:
• Efficient Pattern Matching: The Z algorithm is fast for string
matching, with linear time complexity in practice.
• Simple and Elegant: The Z algorithm is conceptually simple and
uses previously computed results to avoid redundant calculations,
leading to an overall efficient performance.
• Multiple Occurrences: The Z algorithm can efficiently find all
occurrences of a pattern in a text.
Disadvantages:
• No Preprocessing: The Z algorithm does not perform
preprocessing in the same way as algorithms like KMP or Boyer-
Moore, which could result in fewer optimizations for specific
patterns.
• Limited Use: While the Z algorithm is great for certain problems like
pattern matching and finding repeated substrings, it might not be the
best choice for all types of string searching or matching tasks.
Explanation of the Code:
1. The z_algorithm function computes the Z-array for a given string s.
2. The pattern_matching function concatenates the pattern and the
text with a special delimiter ($), computes the Z-array for the
concatenated string, and then looks for all occurrences of the
pattern by checking where the Z-value is equal to the length of the
pattern.
3. For each match, the function prints the position in the text where the
pattern is found.

You might also like