Unit 5
Unit 5
Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer
–Moore algorithm, the Knuth-Morris-Pratt algorithm, Standard Tries, Compressed
Tries, Suffix tries.
Pattern Searching
Pattern searching is an algorithm that involves searching for patterns such as strings, words,
images, etc.
The Pattern Searching algorithm is useful for finding patterns in substrings of larger
strings. This process can be accomplished using a variety of algorithms
It is a process which takes Pattern as input of length ‘P’ and Text of length ‘T’,
where ‘P’ is smaller than ‘T’.
Pattern matching techniques has two categories:
• Single pattern matching technique
• Multiple pattern matching technique
In single pattern matching it is required to find all occurrences of the pattern in the given
input text. And if more than one pattern is matched against the given input text
simultaneously, then it is known as, multiple pattern matching.
The main objective behind the pattern-matching algorithms is to reduce the total number of
character comparisons between the pattern and the text to increase the overall efficiency. The
efficiency of algorithms is evaluated by their running times and the type of inputs they
provided. The pattern matching algorithms are widely used in network security
environments, Information Retrieval, Text Editors etc.
The Pattern Searching algorithms are sometimes also referred to as String Searching
Algorithms and are considered as a part of the String algorithms. These algorithms are useful
in the case of searching a string within another string.
Brute Force Algorithm
The simplest approach for string matching problem is – The Brute Force Algorithm which
is also known as Naive Algorithm. It follows linear search approach.
As shown in the Figure the algorithm simply tries to match the first letter of the Text and the
first letter of the Pattern and checks whether these two letters are equal. If it is, then check
second letters of the text and pattern. If it is not equal, then move first letter of the pattern to
the second letter of the text. Then check these two letters. When we find a match, return its
starting location.
Example: Let the Text (T) be, THIS IS A SIMPLE EXAMPLE and the Pattern (P) be,
SIMPLE
Brute-force string matching compares a given pattern with all substrings of a given text.
Those comparisons between substring and pattern proceed character by character unless a
mismatch is found. Whenever a mismatch is found, the remaining character comparisons for
that substring are dropped and the next substring can be selected immediately.
A brute force algorithm is one of the simplest ways of string searching. It is also one of the
most inefficient ways in terms of time and space complexity. It is popular because of its
simplicity.
Example 2:
m is the size of P
P: pep
This means for every comparison the maximum number of characters to be compared is m.
The entire pattern is checked for comparison each time. We traverse until n - m elements of
the string, but not to n - 1 because there must be m elements to compare at the end.
The moment there is a mismatch the pattern is shifted by one position without considering
any other way around. The process is continued until the last character of the string S is
encountered.
Boyer-Moore Algorithm
The Boyer Moore algorithm is a searching algorithm in which a string of length n and a
pattern of length m is searched. It prints all the occurrences of the pattern in the Text. In
this algorithm, we start to check characters from the right and then move to the left. We
compare characters of the pattern (let us call it P) and text (let us call it T). We get a match
then we check the next characters of both P and T, and if we get a mismatch then we check
whether the character in T which got the mismatch (let us call it ‘c’) is present
anywhere in P. If it is present then we shift P until we get c occurrence in P aligned with
T, and if it is not present then we completely shift P past that character in T. This type of
shift avoids a comparison of needless characters again and again, improving the running time
of the algorithm. Like the other string matching algorithms, this algorithm also preprocesses
the pattern. Boyer Moore uses a combination of two approaches –
You are given two strings. One is the pattern and the other one is the
text. You have to find the pattern within the text. Let’s look at an
example,
You can clearly see that the Text has the pattern but the computer
cannot identify that just like that. This is what string matching
algorithms are for. So now let’s see step by step how to calculate the
bad character table.
Notice the Letters in the above picture. We have to get the unique
characters in the pattern as letters. And the star (*) is for any other
character we may come upon when going through the Text that is
not in the pattern. Now we have to fill this table.
Letter | T | E | S | * |
Value | 1 | 2 | 1 | 4 |
Value( T ) = 4–0–1 = 3
Value( E ) = 4–1–1= 2
Value( S ) = 4–2–1= 1
S and T are not matching. Find the S value from the Bad Match Table. It is 1. So shift 1
position.
T H I S I S A T E S T
T E S T
Space and T are not matching. Find the Space value from the Bad Match Table. It is 4 . So
shift 4 positions.
T H I S I S A T E S T
T E S T
A and T are not matching. Find the A value from the Bad Match Table. It is 4. So shift 4
positions.
T H I S I S A T E S T
T E S T
S and T are not matching. Find the S value from the Bad Match Table. It is 1. So shift 1
position.
T H I S I S A T E S T
T E S T
OR
Here we consider two cases, and we call the character of the text which is not matching with
the pattern character a bad character.
Case 1: The mismatched character of text T is present in pattern P.
In this case, we will shift the pattern P until it gets aligned to the mismatched character of T.
Since we got a mismatch between the ‘R’ of text (the bad character) and ‘C’ of the pattern,
but we also know that ‘R’ is present in the pattern, so it is the first case, then we will shift the
pattern until it matches with ‘R’ (bad character) of the text.
Here is what we got after shifting the pattern. We shifted it because it might be the case, in
some situations, that we may get a matching pattern from that position.
Here we got a mismatch, but “G” is not present in the pattern so there is no point in
comparing the pattern again to any previous one, so we will shift pattern P until after “G” of
text.
After the shift, we got our pattern in the above example, but there may be a case in which we
didn’t get any matches. In that case we would return -1.
int badchar[NO_OF_CHARS];
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}
Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'
For 'p' the prefix function, ? was computed previously and is as follows:
Solution:
Initially: n = size of T = 15
m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of
shifts that took place for the match to be found is i-m = 13 - 7 = 6 shifts.
Trie
The word "Trie" is an excerpt from the word "retrieval". Trie is a sorted tree-based
data-structure that stores the set of strings. It has the number of pointers equal to the
number of characters of the alphabet in each node. It can search a word in the
dictionary with the help of the word's prefix. For example, if we assume that all
strings are formed from the letters 'a' to 'z' in the English alphabet, each trie node
can have a maximum of 26 points.
Trie is also known as the digital tree or prefix tree. The position of a node in the Trie
determines the key with which that node is connected.
The diagram below depicts a trie representation for the bell, bear, bore, bat, ball, stop, stock,
and stack.
Deletion of a node
While performing the delete operation, we will be deleting the key using recursion in a
bottom-up way.
The first case will be to check if the key exists or not in the trie data structure. We have to
make sure that deleting nodes from trie should not modify the trie.
For example:
We will search for the word “best” in the trie, but since it does not exist, hence no
deletion will occur.
The second case is that the key should be unique, i.e., no part of the key should
contain any other key, nor the key itself will have another key. Then we can delete all
the nodes.
The third case is when the key itself is a prefix of another key in the trie data
structure. (Prefix case)
For example, we want to delete “car” but “car” is also present in “carter”. Hence, we will
update the leaf node boolean value “true” to ”false” for the string “car”.
The fourth case is when the key we are deleting has another key as its prefix.
For example, we want to delete “balling”. So we can delete all the nodes from the bottom till
we reach the leaf node for the word “ball”. Hence, we will delete 3 characters “g”, “n” and
then “i”.
Advantages of Trie
1. It can be insert faster and search the string than hash tables and binary
search trees.
2. It provides an alphabetical filter of entries by the key of the node.
Disadvantages of Trie
1. It requires more memory to store the strings.
2. It is slower than the hash table.
Applications of tries
1. Tries has an ability to insert, delete or search for the entries. Hence they
are used in building dictionaries such as entries for telephone numbers,
English words.
2. Tries are also used in spell-checking softwares.
Compressed Trie
A Compressed trie have the following properties:
1. A Compressed Trie is an advanced version of the standard trie.
2. Each nodes(except the leaf nodes) have atleast 2 children.
3. It is used to achieve space optimization.
4. To derive a Compressed Trie from a Standard Trie, compression of
chains of redundant nodes is performed.
5. It consists of grouping, re-grouping and un-grouping of keys of
characters.
6. While performing the insertion operation, it may be required to un-
group the already grouped characters.
7. While performing the deletion operation, it may be required to re-
group the already grouped characters.
8. A compressed trie T storing s strings(keys) has s external nodes
and O(s) total number of nodes.
Below is the illustration of the Compressed Trie:
How to build a Suffix Tree for a given text?
As discussed above, Suffix Tree is compressed trie of all suffixes, so
following are very abstract steps to build a suffix tree from given text. 1)
Generate all suffixes of given text. 2) Consider all suffixes as individual
words and build a compressed trie. Let us consider an example text
“banana$” where ‘$’ is string termination character. Following are all
suffixes of “banana$”
a
na
ana
nana
anana
banana
If we consider all of the above suffixes as individual words and build a trie,
we get following.
If we join chains of single nodes, we get the following compressed trie,
which is the Suffix Tree for given text “banana$”