0% found this document useful (0 votes)
14 views42 pages

Unit 5

Uploaded by

Avyuktha Raju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views42 pages

Unit 5

Uploaded by

Avyuktha Raju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Unit-5

Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer
–Moore algorithm, the Knuth-Morris-Pratt algorithm, Standard Tries, Compressed
Tries, Suffix tries.

Pattern Searching
Pattern searching is an algorithm that involves searching for patterns such as strings, words,
images, etc.
The Pattern Searching algorithm is useful for finding patterns in substrings of larger
strings. This process can be accomplished using a variety of algorithms
It is a process which takes Pattern as input of length ‘P’ and Text of length ‘T’,
where ‘P’ is smaller than ‘T’.
Pattern matching techniques has two categories:
• Single pattern matching technique
• Multiple pattern matching technique
In single pattern matching it is required to find all occurrences of the pattern in the given
input text. And if more than one pattern is matched against the given input text
simultaneously, then it is known as, multiple pattern matching.
The main objective behind the pattern-matching algorithms is to reduce the total number of
character comparisons between the pattern and the text to increase the overall efficiency. The
efficiency of algorithms is evaluated by their running times and the type of inputs they
provided. The pattern matching algorithms are widely used in network security
environments, Information Retrieval, Text Editors etc.
The Pattern Searching algorithms are sometimes also referred to as String Searching
Algorithms and are considered as a part of the String algorithms. These algorithms are useful
in the case of searching a string within another string.
Brute Force Algorithm

The simplest approach for string matching problem is – The Brute Force Algorithm which
is also known as Naive Algorithm. It follows linear search approach.

As shown in the Figure the algorithm simply tries to match the first letter of the Text and the
first letter of the Pattern and checks whether these two letters are equal. If it is, then check
second letters of the text and pattern. If it is not equal, then move first letter of the pattern to
the second letter of the text. Then check these two letters. When we find a match, return its
starting location.

Example: Let the Text (T) be, THIS IS A SIMPLE EXAMPLE and the Pattern (P) be,
SIMPLE
Brute-force string matching compares a given pattern with all substrings of a given text.
Those comparisons between substring and pattern proceed character by character unless a
mismatch is found. Whenever a mismatch is found, the remaining character comparisons for
that substring are dropped and the next substring can be selected immediately.

A brute force algorithm is one of the simplest ways of string searching. It is also one of the
most inefficient ways in terms of time and space complexity. It is popular because of its
simplicity.

Example 2:

the pattern is denoted as P

the string is denoted as S

m is the size of P

n is the size of the S

m is less than or equal to n

S: Peter Piper picked a peck of pickled peppers

P: pep

Each of the characters of the P will be matched to the S.

This means for every comparison the maximum number of characters to be compared is m.

The entire pattern is checked for comparison each time. We traverse until n - m elements of
the string, but not to n - 1 because there must be m elements to compare at the end.

The moment there is a mismatch the pattern is shifted by one position without considering
any other way around. The process is continued until the last character of the string S is
encountered.
Boyer-Moore Algorithm

The Boyer Moore algorithm is a searching algorithm in which a string of length n and a
pattern of length m is searched. It prints all the occurrences of the pattern in the Text. In
this algorithm, we start to check characters from the right and then move to the left. We
compare characters of the pattern (let us call it P) and text (let us call it T). We get a match
then we check the next characters of both P and T, and if we get a mismatch then we check
whether the character in T which got the mismatch (let us call it ‘c’) is present
anywhere in P. If it is present then we shift P until we get c occurrence in P aligned with
T, and if it is not present then we completely shift P past that character in T. This type of
shift avoids a comparison of needless characters again and again, improving the running time
of the algorithm. Like the other string matching algorithms, this algorithm also preprocesses
the pattern. Boyer Moore uses a combination of two approaches –

 Bad character heuristic.


 Good character heuristic.

BAD CHARACTER HEURISTICS

You are given two strings. One is the pattern and the other one is the
text. You have to find the pattern within the text. Let’s look at an
example,

Text => “THIS IS A TEST”

Pattern => “TEST”

You can clearly see that the Text has the pattern but the computer
cannot identify that just like that. This is what string matching
algorithms are for. So now let’s see step by step how to calculate the
bad character table.

Steps are as follows,

1. Take the pattern and index each character starting from 0


pattern indexing

2. Take the length of the pattern as length like above.

3. Now let’s draw the skeleton of the table.

initial bad character table

Notice the Letters in the above picture. We have to get the unique
characters in the pattern as letters. And the star (*) is for any other
character we may come upon when going through the Text that is
not in the pattern. Now we have to fill this table.

4. Now, for every character in the pattern, calculate max(1, length-


index-1) and put it in the table.

Here, max(1, length-index-1) means that you have to choose the


maximum number between 1 and the calculated length-index-
1 value for every letter. Which simply means that the value can’t be
less than 1.
length = length of the pattern

index = index of the character

NOTE : If the same character repeats, update the table by the


values from the new character (or simply consider the rightmost
occurrence of the character in the pattern and calculate the value
for it)

The value for the * is the length of the pattern

Letter | T | E | S | * |

Value | 1 | 2 | 1 | 4 |

Value( T ) = 4–0–1 = 3

Value( E ) = 4–1–1= 2

Value( S ) = 4–2–1= 1

Value( T ) = 4–3–1=1 update the T value with


Rightmost occurrence of T

Text => “THIS IS A TEST”

Pattern => “TEST”


T H I S I S A T E S T
T E S T

S and T are not matching. Find the S value from the Bad Match Table. It is 1. So shift 1
position.

T H I S I S A T E S T
T E S T

Space and T are not matching. Find the Space value from the Bad Match Table. It is 4 . So
shift 4 positions.

T H I S I S A T E S T
T E S T

A and T are not matching. Find the A value from the Bad Match Table. It is 4. So shift 4
positions.

T H I S I S A T E S T
T E S T

S and T are not matching. Find the S value from the Bad Match Table. It is 1. So shift 1
position.

T H I S I S A T E S T
T E S T

All are matching. Patter is found at 11th position

OR

Here we consider two cases, and we call the character of the text which is not matching with
the pattern character a bad character.
Case 1: The mismatched character of text T is present in pattern P.
In this case, we will shift the pattern P until it gets aligned to the mismatched character of T.
Since we got a mismatch between the ‘R’ of text (the bad character) and ‘C’ of the pattern,
but we also know that ‘R’ is present in the pattern, so it is the first case, then we will shift the
pattern until it matches with ‘R’ (bad character) of the text.
Here is what we got after shifting the pattern. We shifted it because it might be the case, in
some situations, that we may get a matching pattern from that position.

Case 2: The mismatch character of text T is not present in pattern P.


In this case, we will shift the pattern P until it gets past that mismatched character of T.

Here we got a mismatch, but “G” is not present in the pattern so there is no point in
comparing the pattern again to any previous one, so we will shift pattern P until after “G” of
text.
After the shift, we got our pattern in the above example, but there may be a case in which we
didn’t get any matches. In that case we would return -1.

/* C Program for Bad Character Heuristic of Boyer


Moore String Matching Algorithm */
# include <limits.h>
# include <string.h>
# include <stdio.h>

# define NO_OF_CHARS 256

// A utility function to get maximum of two integers


int max (int a, int b) { return (a > b)? a: b; }

// The preprocessing function for Boyer Moore's


// bad character heuristic
void badCharHeuristic( char *str, int size,
int badchar[NO_OF_CHARS])
{
int i;

// Initialize all occurrences as -1


for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;

// Fill the actual value of last occurrence


// of a character
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
/* A pattern searching function that uses Bad
Character Heuristic of Boyer Moore Algorithm */
void search( char *txt, char *pat)
{
int m = strlen(pat);
int n = strlen(txt);

int badchar[NO_OF_CHARS];

/* Fill the bad character array by calling


the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);

int s = 0; // s is shift of the pattern with


// respect to text
while(s <= (n - m))
{
int j = m-1;

/* Keep reducing index j of pattern while


characters of pattern and text are
matching at this shift s */
while(j >= 0 && pat[j] == txt[s+j])
j--;

/* If the pattern is present at current


shift, then index j will become -1 after
the above loop */
if (j < 0)
{
printf("\n pattern occurs at shift = %d", s);

/* Shift the pattern so that the next


character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s+m < n)? m-badchar[txt[s+m]] : 1;

else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}

/* Driver program to test above function */


int main()
{
char txt[] = "ABAAABCD";
char pat[] = "ABC";
search(txt, pat);
return 0;
}
GOOD SUFFIX HEURISTIC
Knuth Morris Pratt (KMP ) Algorithm:
KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands for
Knuth Morris Pratt. KMP algorithm was invented by Donald Knuth and Vaughan
Pratt together and independently by James H Morris in the year 1970. In the year 1977, all
the three jointly published KMP Algorithm.
KMP algorithm is used to find a "Pattern" in a "Text". This algorithm campares character by
character from left to right. But whenever a mismatch occurs, it uses a preprocessed table
called "Prefix Table" to skip characters comparison while matching. Some times prefix table
is also known as LPS Table. Here LPS stands for "Longest proper Prefix which is also
Suffix".

Steps for Creating LPS Table (Prefix Table)


 Step 1 - Define a one dimensional array with the size equal to the length of the
Pattern. (LPS[size])
 Step 2 - Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
 Step 3 - Compare the characters at Pattern[i] and Pattern[j].
 Step 4 - If both are matched then set LPS[j] = i+1 and increment both i & j values by
one. Goto to Step 3.
 Step 5 - If both are not matched then check the value of variable 'i'. If it is '0' then
set LPS[j] = 0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto
Step 3.
 Step 6- Repeat above steps until all the values of LPS[] are filled.

Let us use above steps to create prefix table for a pattern...


How to use LPS Table
We use the LPS table to decide how many characters are to be skipped for comparison
when a mismatch has occurred.
When a mismatch occurs, check the LPS value of the previous character of the mismatched
character in the pattern. If it is '0' then start comparing the first character of the pattern with
the next character to the mismatched character in the text. If it is not '0' then start
comparing the character which is at an index value equal to the LPS value of the previous
character to the mismatched character in pattern with the mismatched character in the
Text.

KMP Algorithm Works


Let us see a working example of KMP Algorithm to find a Pattern in a Text...
Example 2:

Example: Given a string 'T' and pattern 'P' as follows:

Let us execute the KMP Algorithm to find whether 'P' occurs in 'T.'

For 'p' the prefix function, ? was computed previously and is as follows:

Solution:

Initially: n = size of T = 15
m = size of P = 7
Pattern 'P' has been found to complexity occur in a string 'T.' The total number of
shifts that took place for the match to be found is i-m = 13 - 7 = 6 shifts.

Trie
The word "Trie" is an excerpt from the word "retrieval". Trie is a sorted tree-based
data-structure that stores the set of strings. It has the number of pointers equal to the
number of characters of the alphabet in each node. It can search a word in the
dictionary with the help of the word's prefix. For example, if we assume that all
strings are formed from the letters 'a' to 'z' in the English alphabet, each trie node
can have a maximum of 26 points.

Trie is also known as the digital tree or prefix tree. The position of a node in the Trie
determines the key with which that node is connected.

Properties of the Trie for a set of the string:


1. The root node of the trie always represents the null node.
2. Each child of nodes is sorted alphabetically.
3. Each node can have a maximum of 26 children (A to Z).
4. Each node (except the root) can store one letter of the alphabet.

The diagram below depicts a trie representation for the bell, bear, bore, bat, ball, stop, stock,
and stack.

Basic operations of Trie


1. Insertion of a node
2. Search Operation
3. Deletion of a node

Insert of a node in the Trie


The first operation is to insert a new node into the trie. Before we start the
implementation, it is important to understand some points:
1. Every letter of the input key (word) is inserted as an individual in the Trie_node. Note
that children point to the next level of Trie nodes.
2. The key character array acts as an index of children.
3. If the present node already has a reference to the present letter, set the present node
to that referenced node. Otherwise, create a new node, set the letter to be equal to
the present letter, and even start the present node with this new node.
4. The character length determines the depth of the trie.

Let us try to Insert “and” & “ant” in this Trie:

2. Searching in Trie Data Structure:


Search operation in Trie is performed in a similar way as the insertion
operation but the only difference is that whenever we find that the array of
pointers in curr node does not point to the current character of
the word then return false instead of creating a new node for that current
character of the word.
This operation is used to search whether a string is present in the Trie data
structure or not. There are two search approaches in the Trie data structure.
1. Find whether the given word exists in Trie.
1. Find whether any word that starts with the given prefix exists in
Trie.
There is a similar search pattern in both approaches. The first step in
searching a given word in Trie is to convert the word to characters and then
compare every character with the trie node from the root node. If the current
character is present in the node, move forward to its children. Repeat this
process until all characters are found.
2.1 Searching Prefix in Trie Data Structure:
Search for the prefix “an” in the Trie Data Structure.

Search for the prefix “an” in Trie

2.2 Searching Complete word in Trie Data Structure:


It is similar to prefix search but additionally, we have to check if the word is
ending at the last character of the word or not.
Search “dad” in the Trie data structure

Deletion of a node

While performing the delete operation, we will be deleting the key using recursion in a
bottom-up way.
The first case will be to check if the key exists or not in the trie data structure. We have to
make sure that deleting nodes from trie should not modify the trie.

For example:

WORDS: [“apple”, “ball”, “car”, “dog”, “dogecoin”, “carter”, “balling”]

We want to delete the string “best”.

We will search for the word “best” in the trie, but since it does not exist, hence no
deletion will occur.

The second case is that the key should be unique, i.e., no part of the key should
contain any other key, nor the key itself will have another key. Then we can delete all
the nodes.

For example, we can delete “apple” as it is a unique key.

The third case is when the key itself is a prefix of another key in the trie data
structure. (Prefix case)
For example, we want to delete “car” but “car” is also present in “carter”. Hence, we will
update the leaf node boolean value “true” to ”false” for the string “car”.

The fourth case is when the key we are deleting has another key as its prefix.

For example, we want to delete “balling”. So we can delete all the nodes from the bottom till
we reach the leaf node for the word “ball”. Hence, we will delete 3 characters “g”, “n” and
then “i”.

Advantages of Trie
1. It can be insert faster and search the string than hash tables and binary
search trees.
2. It provides an alphabetical filter of entries by the key of the node.

Disadvantages of Trie
1. It requires more memory to store the strings.
2. It is slower than the hash table.

Applications of tries

1. Tries has an ability to insert, delete or search for the entries. Hence they
are used in building dictionaries such as entries for telephone numbers,
English words.
2. Tries are also used in spell-checking softwares.

Tries are classified into three categories:


1. Standard Trie
2. Compressed Trie
3. Suffix Trie
Standard Trie A standard trie have the following properties:
o It is an ordered tree like data structure.
o Each node(except the root node) in a standard trie is labeled
with a character.
o The children of a node are in alphabetical order.
o Each node or branch represents a possible character of keys or
words.
o Each node or branch may have multiple branches.
o The last node of every key or word is used to mark the end of
word or node.
Ex: S={ bear, bell, bid, bull, buy, sell, stock, stop}

Compressed Trie
A Compressed trie have the following properties:
1. A Compressed Trie is an advanced version of the standard trie.
2. Each nodes(except the leaf nodes) have atleast 2 children.
3. It is used to achieve space optimization.
4. To derive a Compressed Trie from a Standard Trie, compression of
chains of redundant nodes is performed.
5. It consists of grouping, re-grouping and un-grouping of keys of
characters.
6. While performing the insertion operation, it may be required to un-
group the already grouped characters.
7. While performing the deletion operation, it may be required to re-
group the already grouped characters.
8. A compressed trie T storing s strings(keys) has s external nodes
and O(s) total number of nodes.
Below is the illustration of the Compressed Trie:
How to build a Suffix Tree for a given text?
As discussed above, Suffix Tree is compressed trie of all suffixes, so
following are very abstract steps to build a suffix tree from given text. 1)
Generate all suffixes of given text. 2) Consider all suffixes as individual
words and build a compressed trie. Let us consider an example text
“banana$” where ‘$’ is string termination character. Following are all
suffixes of “banana$”
a
na
ana
nana
anana
banana

If we consider all of the above suffixes as individual words and build a trie,
we get following.
If we join chains of single nodes, we get the following compressed trie,
which is the Suffix Tree for given text “banana$”

Indexes of each suffix string


Ex-2
Construct compressed trie

You might also like