Unit-V DS Pattern Matching and Tries
Unit-V DS Pattern Matching and Tries
Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer –Moore algorithm, the
Knuth-Morris-Pratt algorithm, Standard Tries, Compressed Tries, Suffix tries.
Example – 2:
Let our text (T) as, “tetththeheehthtehtheththehehtht”
and our pattern (P) as, “the”
Best case
If pattern found: Finds pattern in first M positions of text.
For example, M=5.
TRIES
A trie (pronounced "try") is a tree-based data structure for storing strings in order to support fast pattern
matching.
The main application for tries is in information retrieval. Indeed, the name "trie" comes from the word
"retrieval".
The trie uses the digits in the keys to organize and search the dictionary.
Although, in practice, we can use any radix to decompose the keys into digits, in our examples, we shall
choose our radixes so that the digits are natural entities such as decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
and letters of the English alphabet (a-z, A-Z).
Trie's Definition
A trie, also known as a prefix tree, is a tree-like data structure that stores a dynamic set of strings,
where keys are usually strings.
Characteristics of Trie's
Each node represents a single character of a string.
The root node represents the empty string.
Children nodes share the same prefix.
Paths from the root to a node represent a key.
Nodes may store additional values and have a flag to mark the end of a key.
Mark the end of the string by setting an end-of-word marker at the last node.
2. Search in Tries
Begin at the root node.
Traverse the trie following the path defined by each character in the string.
If the path exists and ends in an end-of-word node, the string is in the trie.
If the path ends before the string is exhausted or the end-of-word marker is missing, the string isn’t in
the trie.
3. Deletion in Tries
Similar to search but includes removing the end-of-word marker.
Recursively remove nodes up the trie until a node cannot be deleted without affecting other keys.
Auto correct
4. Longest Prefix Matching Algorithm (Maximum Prefix Length Match): This algorithm is used in
networking by the routing devices in IP networking. Optimization of network routes requires contiguous
masking that bound the complexity of lookup a time to O(n), where n is the length of the URL address in
bits.
To speed up the lookup process, Multiple Bit trie schemes were developed that perform the lookups of
multiple bits faster.
IP routing
Advantages of Trie data structure:
Trie allows us to input and finds strings in O(L) time, where L is the length of a single word. It is faster
as compared to both hash tables and binary search trees.
It provides alphabetical filtering of entries by the key of the node and hence makes it easier to print all
words in alphabetical order.
Trie takes less space when compared to BST because the keys are not explicitly saved instead each key
requires just an amortized fixed amount of space to be stored.
Prefix search/Longest prefix matching can be efficiently done with the help of trie data structure.
Since trie doesn’t need any hash function for its implementation so they are generally faster than hash
tables for small keys like integers and pointers.
Tries support ordered iteration whereas iteration in a hash table will result in pseudorandom order given
by the hash function which is usually more cumbersome.
Deletion is also a straightforward algorithm with O(L) as its time complexity, where L is the length of
the word to be deleted.
Disadvantages of Trie data structure:
The main disadvantage of tries is that they need a lot of memory for storing the strings. For each node
we have too many node pointers (equal to number of characters of the alphabet), if space is concerned,
then Ternary Search Tree can be preferred for dictionary implementations.
In Ternary Search Tree, the time complexity of search operation is O(h) where h is the height of the
tree.
Ternary Search Trees also supports other operations supported by Trie like prefix search, alphabetical
order printing, and nearest neighbor search.
The final conclusion is regarding tries data structure is that they are faster but require huge memory for
storing the strings.
Why Trie?
1. With Trie, we can insert and find strings in O(L) time where L represent the length of a single word. This
is obviously faster than BST. This is also faster than Hashing because of the ways it is implemented. We
do not need to compute any hash function. No collision handling is required (like we do in open
addressing and separate chaining)
2. Another advantage of Trie is, we can easily print all words in alphabetical order which is not easily
possible with hashing.
3. We can efficiently do prefix search (or auto-complete) with Trie.
Space and Time Complexity of Tries
Space: A standard trie can require more space than a hash table because of the storage of nodes and
pointers, particularly for sparse datasets.
Search Time: Tries provide O(m) lookup time, where m is the length of the string. This is independent
of the number of keys in the trie.
EXAMPLE
Approach:
An efficient approach is to treat every character of the input key as an individual trie node and insert it
into the trie.
Note that the children are an array of pointers (or references) to next level trie nodes.
The key character acts as an index into the array of children.
If the input key is new or an extension of the existing key, we need to construct non-existing nodes of the
key, and mark end of the word for the last node.
If the input key is a prefix of the existing key in Trie, we simply mark the last node of the key as the end
of a word.
The key length determines Trie depth.
Trie deletion
Here is an algorithm how to delete a node from trie. During delete operation we delete the key in bottom-up
manner using recursion.
The following are possible conditions when deleting key from trie,
1. Key may not be there in trie. Delete operation should not modify trie.
2. Key present as unique key (no part of key contains another key (prefix), nor the key itself is prefix of
another key in trie). Delete all the nodes.
3. Key is prefix key of another long key in trie. Unmark the leaf node.
4. Key present in trie, having at least one other key as prefix key. Delete nodes from end of key until first
leaf node of longest prefix key.
Time Complexity: The time complexity of the deletion operation is O(n) where n is the key length
TYPES OF TRIES
Tries are classified into three categories:
1. Standard Tries
2. Compressed Tries
3. Suffix Tries
STANDARD TRIES
A standard trie have the following properties:
It is an ordered tree like data structure.
Each node (except the root node) in a standard trie is labelled with a character.
The children of a node are in alphabetical order.
Each node or branch represents a possible character of keys or words.
Each node or branch may have multiple branches.
The last node of every key or word is used to mark the end of word or node.
The path from external node to the root yields the string of S.
Below is the illustration of the Standard Trie
Standard Trie Insertion
Strings = {a, an, and, any}
COMPRESSED TRIE
A Compressed trie have the following properties:
1. A Compressed Trie is an advanced version of the standard trie.
2. Each nodes (except the leaf nodes) have atleast 2 children.
3. It is used to achieve space optimization.
4. To derive a Compressed Trie from a Standard Trie, compression of chains of redundant nodes is
performed.
5. It consists of grouping, re-grouping and un-grouping of keys of characters.
6. While performing the insertion operation, it may be required to un-group the already grouped
characters.
7. While performing the deletion operation, it may be required to re-group the already grouped characters.
If we join chains of single nodes, we get the following compressed trie, which is the Suffix Tree for
given text “banana\0”.
Please note that above steps are just to manually create a Suffix Tree. We will be discussing actual
algorithm and implementation in a separate post.
Example
Let us consider an example text “soon$”
1 It is the most basic form of trie. It is an advanced form of standard It is a completely different trie
trie. type with strings stored in
compressed form.
2 Each node with its children Reductant nodes are compressed. It is for inserting suffixes in a
represents alphabets node.
3 Last alphabet is represented by Last alphabet is represented by $ symbol represents the end of
children. children. the node path.
4 It supports operations like It supports operations like It supports operations for suffix
insertion, deletion, and insertion and deletion with matching and searching.
searching. grouping and ungrouping of
already formed groups.
5 A node can have one or more Each node has at least 2 children. Each node has a suffix of
or no children. words.
6 It is a general purpose trie for It helps in optimizing the space It is a special trie type that
storing individual character of a while merging the reductant helps in retrieval of suffix/
word. nodes.