0% found this document useful (0 votes)
107 views26 pages

Unit-V DS Pattern Matching and Tries

Unit-V DS Pattern Matching and Tries

Uploaded by

Mohammed Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views26 pages

Unit-V DS Pattern Matching and Tries

Unit-V DS Pattern Matching and Tries

Uploaded by

Mohammed Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Unit-V: Pattern Matching and Tries

Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer –Moore algorithm, the
Knuth-Morris-Pratt algorithm, Standard Tries, Compressed Tries, Suffix tries.

Pattern Matching – Introduction


o Pattern Matching is the process of identifying specific sequences of characters or elements within a
larger structure like text, data, or images.
o Think of it like finding a specific word in a sentence or a sequence of symbols or values, within a larger
sequence or text.
o Pattern searching is an important problem in computer science. When we do search for a string in
notepad/word file or browser or database, pattern searching algorithms are used to show the search
results.
A typical problem statement would be -
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all
occurrences of pat[] in txt[]. You may assume that n > m.

Basic Concepts Pattern Matching


 Pattern: A pattern is a sequence of characters, symbols, or other data that forms a search criterion. In
text processing, a pattern could be a string of characters.
 Text: The text (or string) is the sequence where the pattern is searched for.
 Match: A match occurs if the pattern is found within the text. The goal of pattern matching is to find
all instances where this occurs or to determine whether the pattern exists in the text.
Examples On Pattern Matching:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"


pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
Different Types of Pattern Matching Algorithms
Brute Force Pattern Matching Algorithm
Checks for the pattern at every possible position in the text. For each position, it compares the pattern
with the corresponding substring in the text. It is Effective for small texts or patterns, but inefficient for
large texts.
Knuth-Morris-Pratt (KMP)
Optimizes the naive approach by avoiding redundant comparisons. It pre-processes the pattern to
determine the longest prefix which is also a suffix, allowing the search to skip some comparisons. It is
Suitable for applications where the same pattern is searched repeatedly in multiple texts.
Boyer-Moore
Works by comparing the pattern to the text from right to left. It uses two heuristics, the bad character
rule and the good suffix rule, to skip sections of the text, offering potentially sub-linear time
complexity. It is Highly efficient for large texts and is considered one of the fastest single-pattern
matching algorithms.
Rabin-Karp
Uses hashing to find any set of pattern occurrences. It hashes the pattern and text's substrings of the
same length and then compares these hashes. If the hashes match, it checks for a direct match. It is
Useful in plagiarism detection or searching for multiple patterns simultaneously.
Finite Automata
Constructs a state machine based on the pattern. The text is then processed character by character,
transitioning between states of the automaton. It is Effective when the same pattern is matched against
many texts, as the automaton needs to be constructed only once.
Aho-Corasick Pattern Matching Algorithm
A more complex algorithm used for finding all occurrences of any of a finite number of patterns within
the text. It constructs a trie of patterns and then a state machine from the trie. Ideal for matching a large
number of patterns simultaneously, like in virus scanning or "grep" utilities.
BRUTE FORCE PATTERN MATCHING
A brute force algorithm is a straight forward approach to solving a problem. It also refers to a programming
style that does not include any shortcuts to improve performance.
 It is based on trial and error where the programmer tries to merely utilize the computer's fast
processing power to solve a problem, rather than applying some advanced algorithms and techniques
developed with human intelligence.
 It might increase both space and time complexity.
 A simple example of applying brute force would be linearly searching for an element in an array. When
each and every element of an array is compared with the data to be searched, it might be termed as a
brute force approach, as it is the most direct and simple way one could think of searching the given data
in the array
Brute Force Pattern Matching Algorithm
1. Start at the beginning of the text and slide the pattern window over it.
2. At each position of the text, compare the characters in the pattern with the characters in the text.
3. If a mismatch is found, move the pattern window one position to the right in the text.
4. Repeat steps 2 and 3 until the pattern window reaches the end of the text.
5. If a match is found (all characters in the pattern match the corresponding characters in the text),
record the starting position of the match.
6. Move the pattern window one position to the right in the text and repeat steps 2-5.
7. Continue this process until the pattern window reaches the end of the text.
Brute Force Pattern Matching Pseudo-code

Example – 1 On Brute Force Algorithm


Let our text (T) as, “THIS IS A SIMPLE EXAMPLE”
and our pattern (P) as, “SIMPLE”

Red Boxes-Mismatch Green Boxes-Match


o In above red boxes says mismatch letters against letters of the text and green boxes says match letters
against letters of the text.
According to the above:
o In first row we check whether first letter of the pattern is matched with the first letter of the text. It is
mismatched, because "S" is the first letter of pattern and "T" is the first letter of text.
o Then we move the pattern by one position. Shown in second row. Then check first letter of the pattern
with the second letter of text. It is also mismatched.
o Likewise, we continue the checking and moving process. In fourth row we can see first letter of the
pattern matched with text. Then we do not do any moving but we increase testing letter of the pattern.
We only move the position of pattern by one when we find mismatches.
o Also, in last row, we can see all the letters of the pattern matched with some letters of the text
continuously.

Example – 2:
Let our text (T) as, “tetththeheehthtehtheththehehtht”
and our pattern (P) as, “the”

Running Time Analysis of Brute Force Pattern Matching Algorithm


Given a pattern M characters in length, and a text N characters in length...
Worst case:
Compares pattern to each substring of text of length M.
For example, M=5.
Total number of comparisons: M (N-M+1)
Worst case time complexity: Ο(MN)

Best case
If pattern found: Finds pattern in first M positions of text.
For example, M=5.

Total number of comparisons: M


Best case time complexity: Ο(M)
if pattern not found: Always mismatch on first character.
For example, M=5.

Total number of comparisons: N


Best case time complexity: Ο(N)
Advantages
1. Very simple technique and also that does not require any preprocessing. Therefore, total running time is
the same as its matching time.
Disadvantages
1. Very inefficient method. Because this method takes only one position movement in each time.

Boyer–Moore Pattern Matching


 The Boyer–Moore Pattern Matching algorithm is one of the most efficient string-searching
algorithms that is the standard benchmark for practical pattern matching. It was developed by Robert
Stephen Boyer and J Strother Moore in the year 1977.
 The Boyer-Moore algorithm works by pre-processing the pattern and then scanning the text from right
to left, starting with the rightmost characters. It is based on the principle that if a mismatch is found,
there is no need to match the remaining characters. This backwards approach significantly reduces the
algorithm's time complexity compared to naive string search methods.
 The Boyer-Moore algorithm has two main components:
i. The bad character rule and
ii. The good suffix rule.
 The bad character rule works by comparing the character in the pattern with the corresponding character
in the data or text. If the characters do not match, the algorithm moves the pattern to the right until it finds
a character that matches.
 The good suffix rule compares the suffix of the pattern with the corresponding suffix of the data or text.
If the suffixes do not match, the algorithm moves the pattern to the right until it finds a matching suffix.
 The Boyer-Moore algorithm is known for its efficiency and is widely used in many applications. It is
considered one of the fastest pattern matching algorithms available.
The shift steps are explained below
1. Text character a mismatch with pattern character b. Character a appears in the Last Occurrence Table
with index 4. Shift pattern so index 4 aligns with the mismatched text character a.
2. Text character a mismatch with pattern character c. Character a appears in the Last Occurrence Table
with index 4. Shifting the pattern so index 4 aligns with the mismatched text character a would shift the
pattern backward. This does not make sense, so all we can do is shift the pattern by 1.
3. Text character a mismatch with pattern character b. Character a appears in the Last Occurrence Table
with index 4. Shift pattern so index 4 aligns with the mismatched text character a.
4. Text character d mismatch with pattern character b. Character d does not appear in the Last Occurrence
Table. We can therefore shift the entire pattern past this mismatch.
5. Text character a mismatch with pattern character b. Character a appears in the Last Occurrence Table
with index 4. Shift pattern so index 4 aligns with the mismatched text character a.
6. All characters match, i.e. we have a full match. Shift the pattern naively by 1.
7. Same scenario as for comparison 7. Align pattern's index 4 with the mismatched text character a.
8. Same scenario as above.
9. Exact same scenario as for comparisons 2, 3, and 4. Shift pattern by 1.
10. Text character b mismatch with pattern character a. We reached the end of the text so we are done.

Knuth-Morris-Pratt (KMP) algorithm


 The KMP algorithm is used to solve the pattern matching problem which is a task of finding all the
occurrences of a given pattern in a text.
 It is very useful when it comes to finding multiple patterns. For instance, if the text is
"aabbaaccaabbaadde" and the pattern is "aabbaa", then the pattern occurs twice in the text, at indices
0 and 8.
 The naive solution to this problem is to compare the pattern with every possible substring of the text,
starting from the leftmost position and moving rightwards.
 This takes O(n*m) time, where 'n' is the length of the text and 'm' is the length of the pattern.
 When we work with long text documents, the brute force and naive approaches may result in redundant
comparisons. To avoid such redundancy, Donald Knuth, James H. Morris, and Vaughan Pratt in the
year 1970 developed a linear sequence-matching algorithm named the KMP pattern matching
algorithm. It is also referred to as Knuth Morris Pratt pattern matching algorithm.
 It is used to find the occurrences of a "pattern" within a "text" without checking every single character
in the text, which is a significant improvement over the brute-force approach.
 The KMP algorithm compares the pattern to the text in left-to-right, but shifts the pattern, P more
intelligently than the brute-force algorithm.
 When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant
comparisons. The answer is that the largest prefix of P[0..j] that is a suffix of P[1..j].
Here's a step-by-step explanation of how the KMP algorithm works:
1. Preprocessing (Building the LPS Array)
 The core idea to preprocess the pattern is to construct an LPS(Longest Prefix Suffix) array.
 This array stores the length of the longest proper prefix which is also a suffix for each sub-pattern of
the pattern.
 This preprocessing helps in determining the next positions in the pattern to be compared, thus
avoiding redundant comparisons.
Procedure for constructing LPS Table (Largest Prefix Suffix Table):
1. Start by initializing the first element of lps[] to 0, as a single character can't have any proper prefix
or suffix.
2. Maintain two pointers, len and i, where len is the length of the last longest prefix suffix.
Initially, len := 0 and i := 1.
3. Repeat steps 4 to 6, while (i < m)
4. If pattern[len] equals pattern[i], set lps[i] = len + 1, increment both i and len.
5. If they don't match and len is not 0, update len to lps[len - 1].
6. If they don't match and len is 0, set lps[i] = 0 and increment i.
2. Searching
Once the preprocessing is done, the actual search begins:
a. Align the pattern with the beginning of the text.
b. Compare the pattern with the text from left to right.
c. If all characters of the pattern match, a valid occurrence is found.
3. Shifting the Pattern:
 Compare pattern[j] with text[i].
 If they match, increment both i and j.
 If j equals the pattern length, a match is found. Optionally report the match, then set j to lps[j - 1].
 If they don't match and j is not 0, set j to lps[j - 1]. Do not increment i here.
 If they don't match and j is 0, increment i
4. Repeat Comparison
a. Continue comparing the pattern with the text from left to right.
b. Apply the shifting rules whenever a mismatch is encountered.
c. Continue this process until the end of the text is reached or all occurrences of the pattern are found.
5. Termination
The algorithm terminates when either
 The pattern has been shifted past the end of the text, indicating no more matches are possible.
 All occurrences of the pattern have been found.

TRIES
 A trie (pronounced "try") is a tree-based data structure for storing strings in order to support fast pattern
matching.
 The main application for tries is in information retrieval. Indeed, the name "trie" comes from the word
"retrieval".
 The trie uses the digits in the keys to organize and search the dictionary.
 Although, in practice, we can use any radix to decompose the keys into digits, in our examples, we shall
choose our radixes so that the digits are natural entities such as decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
and letters of the English alphabet (a-z, A-Z).
Trie's Definition
A trie, also known as a prefix tree, is a tree-like data structure that stores a dynamic set of strings,
where keys are usually strings.

Characteristics of Trie's
 Each node represents a single character of a string.
 The root node represents the empty string.
 Children nodes share the same prefix.
 Paths from the root to a node represent a key.
 Nodes may store additional values and have a flag to mark the end of a key.

Basic Operations of Trie's


1. Insertion in Tries
 Start at the root node.
 For each character in the string, move to the corresponding child node.

 If the child node doesn’t exist, create it.

 Mark the end of the string by setting an end-of-word marker at the last node.

2. Search in Tries
 Begin at the root node.

 Traverse the trie following the path defined by each character in the string.

 If the path exists and ends in an end-of-word node, the string is in the trie.

 If the path ends before the string is exhausted or the end-of-word marker is missing, the string isn’t in

the trie.
3. Deletion in Tries
 Similar to search but includes removing the end-of-word marker.

 If a node becomes unnecessary (no children), it is removed.

 Recursively remove nodes up the trie until a node cannot be deleted without affecting other keys.

Different Types of Tries


1. Standard Trie: The most basic form, which often requires space proportional to the size of the
alphabet for each key character.
2. Compressed Trie: Optimizes space by compressing chains of single-child nodes into single edges.
3. Suffix Trie: Contains all the suffixes of a given text, used for pattern matching problems.

Applications of Trie data structure:


 It has a wide variety of applications in data compression, computational biology, longest prefix matching
algorithm used for routing tables for IP addresses, implementation of the dictionary, pattern searching,
storing\querying XML documents, etc.

Real-time applications of Trie data structure:


1. Browser History: Web browsers keep track of the history of websites visited by the user So when the
prefix of a previously visited URL is written in the address bar the user would be given suggestions of the
website to visit.
Trie is used by storing the number of visits to a website as the key value and organizing this history on the
Trie data structure.
Browser history suggestions
2. AutoComplete: It is one of the most important applications of trie data structure. This feature speeds up
interactions between a user and the application and greatly enhances the user experience. Auto Complete
feature is used by web browsers, email, search engines, code editors, command-line interpreters (CLI), and
word processors.
Trie provides the alphabetical ordering of data by keys. Trie is used because it is the fastest for auto-
complete suggestions, even in the worst case, it is O(n) (where n is the string length ) times faster than the
alternate imperfect hash table algorithm and also overcomes the problem of key collisions in imperfect
hash tables.

Auto complete suggestions on entering prefix


3. Spell Checkers/Auto-correct: It is a 3-step process that includes:
i. Checking for the word in the data dictionary.
ii. Generating potential suggestions.
iii. Sorting the suggestions with higher priority on top.
Trie stores the data dictionary and makes it easier to build an algorithm for searching the word from the
dictionary and provides the list of valid words for the suggestion.

Auto correct
4. Longest Prefix Matching Algorithm (Maximum Prefix Length Match): This algorithm is used in
networking by the routing devices in IP networking. Optimization of network routes requires contiguous
masking that bound the complexity of lookup a time to O(n), where n is the length of the URL address in
bits.
To speed up the lookup process, Multiple Bit trie schemes were developed that perform the lookups of
multiple bits faster.

IP routing
Advantages of Trie data structure:
 Trie allows us to input and finds strings in O(L) time, where L is the length of a single word. It is faster
as compared to both hash tables and binary search trees.
 It provides alphabetical filtering of entries by the key of the node and hence makes it easier to print all
words in alphabetical order.
 Trie takes less space when compared to BST because the keys are not explicitly saved instead each key
requires just an amortized fixed amount of space to be stored.
 Prefix search/Longest prefix matching can be efficiently done with the help of trie data structure.
 Since trie doesn’t need any hash function for its implementation so they are generally faster than hash
tables for small keys like integers and pointers.
 Tries support ordered iteration whereas iteration in a hash table will result in pseudorandom order given
by the hash function which is usually more cumbersome.
 Deletion is also a straightforward algorithm with O(L) as its time complexity, where L is the length of
the word to be deleted.
Disadvantages of Trie data structure:
 The main disadvantage of tries is that they need a lot of memory for storing the strings. For each node
we have too many node pointers (equal to number of characters of the alphabet), if space is concerned,
then Ternary Search Tree can be preferred for dictionary implementations.
 In Ternary Search Tree, the time complexity of search operation is O(h) where h is the height of the
tree.
 Ternary Search Trees also supports other operations supported by Trie like prefix search, alphabetical
order printing, and nearest neighbor search.
 The final conclusion is regarding tries data structure is that they are faster but require huge memory for
storing the strings.
Why Trie?
1. With Trie, we can insert and find strings in O(L) time where L represent the length of a single word. This
is obviously faster than BST. This is also faster than Hashing because of the ways it is implemented. We
do not need to compute any hash function. No collision handling is required (like we do in open
addressing and separate chaining)
2. Another advantage of Trie is, we can easily print all words in alphabetical order which is not easily
possible with hashing.
3. We can efficiently do prefix search (or auto-complete) with Trie.
Space and Time Complexity of Tries
 Space: A standard trie can require more space than a hash table because of the storage of nodes and
pointers, particularly for sparse datasets.
 Search Time: Tries provide O(m) lookup time, where m is the length of the string. This is independent
of the number of keys in the trie.

EXAMPLE

Trie | (Insert and Search)


 Trie is an efficient information retrieval data structure.
 Using Trie, search complexities can be brought to an optimal limit (key length).
 Given multiple strings, the task is to insert the string in a Trie.
Examples:
Example 1: str = {"cat", "there", "caller", "their", "calling", “bat”}

Example 2: str = {"candy", "cat", "caller", "calling"}

Approach:
 An efficient approach is to treat every character of the input key as an individual trie node and insert it
into the trie.
 Note that the children are an array of pointers (or references) to next level trie nodes.
 The key character acts as an index into the array of children.
 If the input key is new or an extension of the existing key, we need to construct non-existing nodes of the
key, and mark end of the word for the last node.
 If the input key is a prefix of the existing key in Trie, we simply mark the last node of the key as the end
of a word.
 The key length determines Trie depth.
Trie deletion
Here is an algorithm how to delete a node from trie. During delete operation we delete the key in bottom-up
manner using recursion.
The following are possible conditions when deleting key from trie,
1. Key may not be there in trie. Delete operation should not modify trie.
2. Key present as unique key (no part of key contains another key (prefix), nor the key itself is prefix of
another key in trie). Delete all the nodes.
3. Key is prefix key of another long key in trie. Unmark the leaf node.
4. Key present in trie, having at least one other key as prefix key. Delete nodes from end of key until first
leaf node of longest prefix key.
Time Complexity: The time complexity of the deletion operation is O(n) where n is the key length

TYPES OF TRIES
Tries are classified into three categories:
1. Standard Tries
2. Compressed Tries
3. Suffix Tries

STANDARD TRIES
A standard trie have the following properties:
 It is an ordered tree like data structure.
 Each node (except the root node) in a standard trie is labelled with a character.
 The children of a node are in alphabetical order.
 Each node or branch represents a possible character of keys or words.
 Each node or branch may have multiple branches.
 The last node of every key or word is used to mark the end of word or node.
 The path from external node to the root yields the string of S.
Below is the illustration of the Standard Trie
Standard Trie Insertion
Strings = {a, an, and, any}

Example of Standard Trie


Standard trie for the following strings:
S = {bear, bell, bid, bull, buy, sell, stock, stop}
Handling Keys(strings)
When a key is prefix of another key
How can we know that “an “ is a word
Example: an, and

Standard Trie Searching


Search hit where search node has a $ symbol

Standard Trie Deletion


To perform the deletion there exist cases
1. Word not found
return false
2. Word exists as a standalone word
i. Part of any other node
Example:
ii. Does not part of any other node
Example:

3. Word exists as a prefix of another word.

COMPRESSED TRIE
A Compressed trie have the following properties:
1. A Compressed Trie is an advanced version of the standard trie.
2. Each nodes (except the leaf nodes) have atleast 2 children.
3. It is used to achieve space optimization.
4. To derive a Compressed Trie from a Standard Trie, compression of chains of redundant nodes is
performed.
5. It consists of grouping, re-grouping and un-grouping of keys of characters.
6. While performing the insertion operation, it may be required to un-group the already grouped
characters.
7. While performing the deletion operation, it may be required to re-group the already grouped characters.

Compressed trie is constructed from standard trie


Storage of Compressed Trie
 A compressed Trie can be stored at O9s) where s= |S| by using O(1) Space index ranges at the nodes
 In the below representation each node is represented with (I, j, k) value
I ---- indicate index of the string
j—starting index of the character of string I
k--- ending index of the character of the string I
 Ex: In the given diagram node (4, 2, 3) having the characters(ll) which belongs to s[4] so i=4, index of l
character in s[4] is 2 so j=2 and ending index is 3 so k=3
Suffix Tries
 A suffix trie (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all
the suffixes of the given text as their keys and positions in the text as their values.
 Suffix tries allow particularly fast implementations of many important string operations.
 A Suffix Tree for a given text is a compressed trie for all suffixes of the given text.
A Suffix trie have the following properties:
1. Suffix trie is a compressed trie for all the suffixes of the text
2. Suffix trie are space efficient data structure to store a string that allows many kinds of queries to be
answered quickly.
building a Suffix Tree for a given text
 As discussed above, Suffix Tree is compressed trie of all suffixes, so following are very abstract steps
to build a suffix tree from given text.
1. Generate all suffixes of given text.
2. Consider all suffixes as individual words and build a compressed trie.
 Let us consider an example text “banana\0” where ‘\0’ is string termination character. Following are
all suffixes of “banana\0”
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
 If we consider all of the above suffixes as individual words and build a trie, we get following.

 If we join chains of single nodes, we get the following compressed trie, which is the Suffix Tree for
given text “banana\0”.

 Please note that above steps are just to manually create a Suffix Tree. We will be discussing actual
algorithm and implementation in a separate post.

Searching a pattern in the built suffix tree


We have discussed above how to build a Suffix Tree which is needed as a preprocessing step in pattern
searching.
Following are abstract steps to search a pattern in the built Suffix Tree.
1) Starting from the first character of the pattern and root of Suffix Tree, do following for every character.
…..
a) For the current character of pattern, if there is an edge from the current node of suffix tree, follow
the edge. …..
b) If there is no edge, print “pattern doesn’t exist in text” and return.
2) If all characters of pattern have been processed, i.e., there is a path from root for characters of the given
pattern, then print “Pattern found”. Let us consider the example pattern as “nan” to see the searching
process. Following diagram shows the path followed for searching “nan” or “nana”.

Example
Let us consider an example text “soon$”

After alphabetically order the trie look like


Applications of Suffix Tree
Suffix tree can be used for a wide range of problems. Following are some famous problems where Suffix Trees
provide optimal time complexity solution.
 Pattern Searching
 Finding the longest repeated substring
 Finding the longest common substring
 Finding the longest palindrome in a string
Advantages of suffix tries
1. Insertion is faster compared to the hash table
2. Look up is faster than hash table implementation
3. There are no collision of different keys in tries

Difference between Standard trie, Compact trie, and Suffix trie

S. No Standard Trie Compressed Trie Suffix Trie

1 It is the most basic form of trie. It is an advanced form of standard It is a completely different trie
trie. type with strings stored in
compressed form.

2 Each node with its children Reductant nodes are compressed. It is for inserting suffixes in a
represents alphabets node.

3 Last alphabet is represented by Last alphabet is represented by $ symbol represents the end of
children. children. the node path.

4 It supports operations like It supports operations like It supports operations for suffix
insertion, deletion, and insertion and deletion with matching and searching.
searching. grouping and ungrouping of
already formed groups.

5 A node can have one or more Each node has at least 2 children. Each node has a suffix of
or no children. words.

6 It is a general purpose trie for It helps in optimizing the space It is a special trie type that
storing individual character of a while merging the reductant helps in retrieval of suffix/
word. nodes.

You might also like