Harman Design of String Datastructures
Harman Design of String Datastructures
sUFFIX tRIE -> SUFFIX TREE -> SUFFIX ARRAY -> FM INDEX
###############################################3
class LongString {
public LongString(String s) {
// todo
}
/**
* Returns the character at position 'i'.
*/
public char charAt(int i) {
// todo
}
/**
* Deletes the specified character substring.
* The substring begins at the specified 'start' and extends to the
character at index 'end - 1' or to the end of the sequence
* if no such character exists.
* If 'start' is equal to 'end', no changes are made.
*
* @param start The beginning index, inclusive.
* @param end The ending index, exclusive.
* @throws StringIndexOutOfBoundsException if 'start' is negative,
greater than length, or greater than 'end'.
*/
public void delete(int start, int end) {
// todo
}
}
ok so...
NAIVE:
import math
class Node:
def __init__(self, data, lc, rc):
self.data = data
self.lc = lc
self.rc = rc
def __repr__(self):
return f'{self.data}_{self.lc}_{self.rc}'
class LongString:
def __repr__(self):
return f'{self.items}'
height_tree = int(math.ceil(math.log2(len(s))))
number_items = (2**height_tree) + (2**height_tree) - 1
items = [None] * number_items
build(0, 0, 2**height_tree - 1)
return items, height_tree
start_idx = 0
for _ in range(self.tree_height + 1):
node = self.items[node_i]
right_start_idx = node.lc + start_idx
if (node.lc == 0 and node.rc == 0 and right_start_idx == i):
return node.data
if (i < right_start_idx):
node_i = (node_i*2) + 1
else:
node_i = (node_i*2) + 2
start_idx = right_start_idx
return None
node = self.items[node_i]
right_start = node.lc + start_idx
start = start_idx
end = right_start + node.rc
d_end = d_end - 1
traverse(0, 0)
a = LongString("ABCD")
print(a)
for i in range(8):
print(i, a.charAt(i))
a.delete(1, 3)
print(a)
for i in range(8):
print(i, a.charAt(i))
a = LongString("ABCDEF")
print(a)
for i in range(8):
print(i, a.charAt(i))
a.delete(1, 3)
for i in range(8):
print(i, a.charAt(i))
a.delete(1, 3)
print(a)
for i in range(8):
print(i, a.charAt(i))
Description
A rope is a binary tree where each leaf (end node) holds a string and a
length (also known as a "weight"), and each node further up the tree holds the
sum of the lengths of all the leaves in its left subtree. A node with two
children thus divides the whole string into two parts: the left subtree stores the
first
part of the string, the right subtree stores the second part of the string,
and a node's weight is the length of the first part.
For rope operations, the strings stored in nodes are assumed to be constant
immutable objects in the typical nondestructive case, allowing for some copy-on-
write behavior.
Leaf nodes are usually implemented as basic fixed-length strings with a
reference count attached for deallocation when no longer needed, although other
garbage
collection methods can be used as well.
Insert
Definition: Insert(i, S’): insert the string S’ beginning at position i in
the string s, to form a new string C1, …, Ci, S', Ci + 1, …, Cm.
Time complexity: {\displaystyle O(\log N)}O(\log N).
This operation can be done by a Split() and two Concat() operations. The
cost is the sum of the three.
Index
if exists(node.left) then
return index(node.left, i)
end
return node.string[i]
end
For example, to find the character at i=10 in Figure 2.1 shown on the
right, start at the root node (A), find that 22 is greater than 10 and
there is a left child, so go to the left child (B). 9 is less than 10, so
subtract 9 from 10 (leaving i=1) and go to the right child (D). Then because 6 is
greater
than 1 and there's a left child, go to the left child (G). 2 is greater
than 1 and there's a left child, so go to the left child again (J). Finally 2 is
greater than
1 but there is no left child, so the character at index 1 of the short
string "na" (ie "a") is the answer.
Concat
As most rope operations require balanced trees, the tree may need to be re-
balanced after concatenation.
Split
The split point is at the end of a string (i.e. after the last character of
a leaf node)
The split point is in the middle of a string.
The second case reduces to the first by splitting the string at the split
point to create two new leaf nodes, then creating a new node that is the parent of
the two component strings.
For example, to split the 22-character rope pictured in Figure 2.3 into two
equal component ropes of length 11, query the 12th character to locate the node K
at the bottom level. Remove the link between K and G. Go to the parent of G and
subtract the weight of K from the weight of D. Travel up the tree and remove any
right links to subtrees covering characters past position 11, subtracting the
weight of K from their parent nodes (only node D and A, in this case). Finally,
build up the newly orphaned nodes K and H by concatenating them together and
creating a new parent P with weight equal to the length of the left node K.
As most rope operations require balanced trees, the tree may need to be re-
balanced after splitting.
Delete
Definition: Delete(i, j): delete the substring Ci, …, Ci + j − 1, from s to
form a new string C1, …, Ci − 1, Ci + j, …, Cm.
Report
Definition: Report(i, j): output the string Ci, …, Ci + j − 1.
Time complexity: {\displaystyle O(j+\log N)}{\displaystyle O(j+\log N)}
To report the string Ci, …, Ci + j − 1, find the node u that contains Ci
and weight(u) >= j, and then traverse
T starting at node u. Output Ci, …, Ci + j − 1 by doing an in-order
traversal of T starting at node u.
Tries
Tries are some kind of rooted trees in which each edge has a character on
it. Actually, trie is some kind
of DFA (Determining Finite Automata). For a bunch of strings, their trie is
the smallest rooted tree with a character on
each edge and each of these strings can be build by writing down the
characters in the path from the root to some node.
It's advantage is, LCP (Longest Common Prefix) of two of these strings is
the LCA (Lowest Common Ancestor) of their nodes in the
trie(a node that we can build the string by writing down the characters in
the path from the root to that node).
You can add a boolean array, call it something like ends and initialize it
to false, for each word you
insert you set ends to true only in the last character of the word, so in
your search method change the "return true"
line to "return ends[v]". If you insert the word partition, if you look for
part, it will
return false, instead of true (which is what your code is doing).
We'd have to maintain another array of type : bool end[MAX][ALF] for that.
(https://codeforces.com/blog/entry/50357)
Badge of honor for supporting Codeforces on its 10th anniversary
arknave
5 years ago, # | Vote: I like it +1 Vote: I do not like it
Each row in your array represents a node in the trie structure. Node 0 is
the root.
When you want to insert a new string, begin at node 0. Let c be the first
character of your string, and see if trie[0][c] is valid or not. If it is not valid
"allocate" a new node by incrementing the size of your trie and set trie[0][c] to
the newly allocated node.
Initially:
0: -1 -1 -1 ...
1: -1 -1 -1 ...
2: -1 -1 -1 ...
3: -1 -1 -1 ...
Size of the trie is 1 (because of the root node). Now let's insert string
"ABC". Then since trie[0]['A'] is -1, we need to allocate a new node and set
trie[0]['A'] to it. Then our structure is
0: 1 -1 -1 ...
1: -1 -1 -1 ...
2: -1 -1 -1 ...
3: -1 -1 -1 ...
Now we are currently at node 1 and need to insert "BC". Following this
procedure twice, our trie looks like
0: 1 -1 -1 ...
1: -1 2 -1 ...
2: -1 -1 3 ...
3: -1 -1 -1 ...
##################################################################################
Trie Implementation
This data structure is pretty useful for storing large databases of words.
It provides linear time search and linear time insertion into the pool of words.
class TrieNode:
def __init__(self):
self.children = {}
self.leaf = False
class Trie(object):
def __init__(self):
"""
Initialize your data structure here.
"""
self.root = TrieNode()
################################################################
################################################################
TRIES LEETCODE REVIEW
https://leetcode.com/discuss/interview-question/4161389/All-you-need-to-
know-about-trie
class TrieNode {
public:
TrieNode* children[26]; // Assuming lowercase English alphabet
bool isEndOfWord;
TrieNode() {
for (int i = 0; i < 26; i++) {
children[i] = nullptr;
}
isEndOfWord = false;
}
};
class Trie {
public:
TrieNode* root;
Trie() {
root = new TrieNode();
}
// Methods for insertion, search, and other operations.
};
Common Problems and Solutions with Tries:
Search in a Trie:
Problem: Given a prefix, find all words in the trie that start with the
prefix.
Solution: Traverse the trie to the node representing the prefix and perform
a depth-first search to collect words under that node.
Longest Common Prefix:
Problem: Given a 2D board of characters and a list of words, find all the
words from the list in the board.
Solution: Use a trie to efficiently search for words in the board by
backtracking and avoiding unnecessary exploration.
Problem: Create a spell checker that can suggest correct spellings for a
misspelled word.
Solution: Build a trie containing a dictionary of correctly spelled words
and use it to suggest corrections for misspelled words.
class TrieNode {
public:
TrieNode* children[26]; // Assuming lowercase English alphabet
bool isEndOfWord;
TrieNode() {
for (int i = 0; i < 26; i++) {
children[i] = nullptr;
}
isEndOfWord = false;
}
};
class Trie {
public:
TrieNode* root;
Trie() {
root = new TrieNode();
}
2. Searching in a Trie:
#include <iostream>
#include <unordered_map>
#include <string>
class CompressedTrieNode {
public:
std::unordered_map<char, CompressedTrieNode*> children;
bool isEndOfWord;
CompressedTrieNode() : isEndOfWord(false) {}
};
class CompressedTrie {
public:
CompressedTrieNode* root;
CompressedTrie() {
root = new CompressedTrieNode();
}
void insert(const std::string& word) {
CompressedTrieNode* current = root;
for (char ch : word) {
if (current->children.find(ch) == current->children.end()) {
current->children[ch] = new CompressedTrieNode();
}
current = current->children[ch];
}
current->isEndOfWord = true;
}
int main() {
CompressedTrie trie;
trie.insert("apple");
trie.insert("app");
trie.insert("banana");
return 0;
}
Compressed Trie Output:
Standard Trie:
(root)
/ | | \
a b c d
/ / \ / \
p a t o r
/ / \ \ \
p n e n y
/ / \ \ \
l a r e a
Compressed Trie:
In a compressed trie, nodes with a single child are merged into a single
node. This results in a more memory-efficient structure, especially when there are
long sequences of characters shared by multiple words.
(root)
/ | | | \
ab c d e ar
| | | |
pap t o eny
/ \ \ /
l a r e
In the compressed trie, the sequence "ap" is shared by "apple" and "app,"
so it's represented as a single node. Similarly, "en" is shared by "pen" and
"penny," so it's compressed into one node. This compression can significantly
reduce memory usage in scenarios where there is redundancy in the structure of the
tree.
Overall, while both tries represent the same set of words, the compressed
trie is more memory-efficient. However, building and searching in a compressed trie
can be a bit more complex compared to a standard trie due to the compression and
decompression processes.
image
Note : This part is for those who wants to know more about Trie:
Suffix tries and suffix trees are advanced data structures used in string
processing and pattern matching.
They are used to efficiently search for and manipulate substrings within a
larger string. Suffix trees are
losely related to suffix tries and are a more space-efficient
representation of the same information.
1. Suffix Trie:
image
Here's a simple example of a suffix trie for the string "banana" with some
suffixes:
┌──┐
┌─── b └── n ─┐
┌─── a ─┐ └── a
┌─── n ─┐ └─── a
┌─── a ─┐ └── n
┌─── n ─┐ └── a
┌── a ─┐ └── n
│ a │ └── a
│ n │
└─────┘
2. Suffix Tree:
Applications:
Suffix trees and suffix tries are used in various applications, including:
Longest Common Substring: They can be used to find the longest common
substring between two or more strings.
Data Compression: Suffix trees are used in data compression algorithms like
Burrows-Wheeler Transform (BWT) and Run-Length Encoding (RLE).
Text Indexing: Search engines and database systems use suffix trees for
text indexing and searching.
#include <libstree/suffix_tree.h>
int main() {
struct suffix_tree* st = st_create();
int main() {
struct suffix_tree* st = st_create();
Wiki
Multi-Key Trie
A multi-key trie is a trie that stores multiple keys in a single trie. It's
a space-efficient data structure that can be used to store a set of strings where
the keys are usually strings. It's particularly useful for tasks like prefix
searching and string lookups.
#include <iostream>
#include <unordered_map>
using namespace std;
TrieNode() : isEndOfKey(false) {}
};
public:
MultiKeyTrie() {
root = new TrieNode();
}
int main() {
MultiKeyTrie trie;
return 0;
}
Root
|
a
|
*p* (end of "apple")
|
p
|
*p* (end of "appetizer")
|
b
|
*a* (end of "banana")
|
*b* (end of "ball")```
Radix trees, also known as radix tries or compact prefix trees, are a
space-efficient variation of the traditional trie data structure. They are designed
to store and efficiently search strings with common prefixes. Radix trees compress
the trie structure by collapsing linear chains of nodes into a single node,
reducing memory consumption and improving lookup performance. They are commonly
used in networking, file systems, and IP address storage.
#include <iostream>
#include <unordered_map>
using namespace std;
class RadixNode {
public:
unordered_map<char, RadixNode*> children;
string label;
bool isEndOfKey;
RadixNode() : isEndOfKey(false) {}
};
class RadixTree {
private:
RadixNode* root;
if (commonPrefix == child->label) {
insert(child, key.substr(commonPrefix.size()));
} else {
auto newNode = make_unique<RadixNode>();
newNode->label = commonPrefix;
newNode->isEndOfKey = false;
child->label = child->label.substr(commonPrefix.size());
newNode->children[child->label[0]] = move(node-
>children[ch]);
newNode->children[key[commonPrefix.size()]] =
make_unique<RadixNode>();
newNode->children[key[commonPrefix.size()]]->label =
key.substr(commonPrefix.size());
newNode->children[key[commonPrefix.size()]]->isEndOfKey =
true;
node->children[ch] = move(newNode);
}
return;
}
node->isEndOfKey = true;
}
public:
RadixTree() : root(make_unique<RadixNode>().release()) {}
int main() {
RadixTree tree;
return 0;
}
Root
|
a*
|
pple* (end of "apple")
|
etizer* (end of "appetizer")
|
nana* (end of "banana")
|
ball* (end of "ball")
IP Address Lookup
#include <iostream>
#include <vector>
using namespace std;
class RadixNode {
public:
vector<string> ips;
RadixNode* left;
RadixNode* right;
class RadixTree {
private:
RadixNode* root;
if (node->ips.size() > 0) {
string existingIP = node->ips[0];
node->ips.clear();
if (ip[depth] == existingIP[depth]) {
node->left = insert(node->left, ip, depth + 1);
} else {
node->right = insert(node->right, ip, depth + 1);
}
} else {
if (ip[depth] == '0') {
node->left = insert(node->left, ip, depth + 1);
} else {
node->right = insert(node->right, ip, depth + 1);
}
}
return node;
}
public:
RadixTree() : root(nullptr) {}
if (ip[depth] == '0') {
node = node->left;
} else {
node = node->right;
}
depth++;
}
cout << "No IPs found with the prefix: " << ip << endl;
}
};
int main() {
RadixTree tree;
tree.insert("192.168.0.1");
tree.insert("192.168.1.1");
tree.insert("192.168.0.10");
tree.insert("192.168.1.2");
tree.insert("10.0.0.1");
return 0;
}
Trie vs. Hash Table
Tries and hash tables are two popular data structures used to store and
search for data. They are both efficient for searching, but they have different
characteristics that make them suitable for different use cases.
Trie:
Hash Table:
Problem: Given a 2D board of characters and a list of words, find all the
words from the list in the board.
Solution: Use a trie to efficiently search for words in the board by
backtracking and avoiding unnecessary exploration.
Implementing a Spell Checker:
Problem: Create a spell checker that can suggest correct spellings for a
misspelled word.
Solution: Build a trie containing a dictionary of correctly spelled words
and use it to suggest corrections for misspelled words.
IP Address Lookup:
Problem: Given a prefix, find all words in the trie that start with the
prefix.
Solution: Traverse the trie to the node representing the prefix and perform
a depth-first search to collect words under that node.
Deleting a String from a Trie:
#include <iostream>
#include <vector>
class TrieNode {
public:
TrieNode* children[26];
bool isEndOfWord;
TrieNode() {
for (int i = 0; i < 26; i++) {
children[i] = nullptr;
}
isEndOfWord = false;
}
};
class Trie {
public:
TrieNode* root;
Trie() {
root = new TrieNode();
}
string longestCommonPrefix() {
string prefix = "";
TrieNode* current = root;
while (current && !current->isEndOfWord && !
hasMultipleChildren(current)) {
current = current->children[getOnlyChild(current)];
prefix += 'a';
}
return prefix;
}
int main() {
vector<string> strs = {"apple", "appetizer", "banana", "ball"};
cout << longestCommonPrefix(strs) << endl;
return 0;
}
I hope you enjoyed this article and found it useful. If you have any
questions or feedback, feel free to reach out to me on Twitter. Thanks for reading!
+##############################################################################
+
+TODO IMPLEMENT IN PYTHON
+COMPRESSED TRIES (PATRICIA TRIES)
+
+Compressed Tries (Patricia Tries)
+Patricia: Practical Algorithm To Retrieve Information Coded in
+Alphanumeric
+Introduced by Morrison (1968)
+Reduces storage requirement: eliminate unflagged nodes with only
+one child
+Every path of one-child unflagged nodes is compressed to a single
+edge
+Each node stores an index indicating the next bit to be tested during
+a search (index= 0 for the first bit, index= 1 for the second bit, etc)
+A compressed trie storing n keys always has at most n − 1 internal
+(non-leaf) node
+
+Each node stores an index indicating the next bit to be tested during
+a search (Look at CS240 note for visualization) AND a key, which is the completed
word for that node
+from the dictionary of words the trie is generated from
+
+
+Search(x):
+ Follow the proper path from the root down in the tree to a leaf
+ If search ends in an unflagged node, it is unsuccessful
+ If search ends in a flagged node, we need to check if the key stored is
+ indeed x (because we are skipping checks, and checking random indexes, so we
store
+ in each node of the trie, the complete word for that leaf.)
+
+Delete(x):
+ Perform Search(x)
+ if search ends in an internal node, then
+ if the node has two children, then unflag the node and delete the key
+ else delete the node and make his only child, the child of its parent
+ if search ends in a leaf, then delete the leaf and
+ if its parent is unflagged, then delete the parent
+
+Insert(x):
+ Perform Search(x)
+ -> If the search ends at a leaf L with key y , compare x against y .
+ ->If y is a prefix of x, add a child to y containing x.
+ ->Else, determine the first index i where they disagree and create a new
+ node N with index i.
+ Insert N along the path from the root to L so that the parent of N has
+ index < i and one child of N is either L or an existing node on the path
+ from the root to L that has index > i.
+ The other child of N will be a new leaf node containing x.
+ ->If the search ends at an internal node, we find the key corresponding to
+ that internal node and proceed in a similar way to the previous case.
+
+
+##################################################################################
+##################################################################################
+##################################################################################
+STRING SEARCH ALGORITHMS:
+ KMP string searching
+ Manacher’s Algorithm
+ Aho–Corasick string matching algorithm;
+ Z ALGORITHM
+
+###########################################################################
+#############################################################################
+
+Here are some typical thoughts on search types:
+
+Boyer-Moore: works by pre-analyzing the pattern and comparing from right-to-left.
+If a mismatch occurs, the initial analysis is used to determine how far the
pattern can
+be shifted w.r.t. the text being searched. This works particularly well for long
search
+patterns. In particular, it can be sub-linear, as you do not need to read
+every single character of your text.
+
+Knuth-Morris-Pratt: also pre-analyzes the pattern, but tries to re-use whatever
+was already matched in the initial part of the pattern to avoid having to rematch
that.
+This can work quite well, if your alphabet is small (f.ex. DNA bases), as you get
a
+higher chance that your search patterns contain reuseable subpatterns.
+
+Aho-Corasick: Needs a lot of preprocessing, but does so for a number of patterns.
+If you know you will be looking for the same search patterns over and over again,
+then this is much better than the other, because you need to
+analyse patterns only once, not once per search.
+
+Space Usage favors Rabin-Karp
+One major advantage of Rabin-Karp is that it uses O(1) auxiliary storage space,
+which is great if the pattern string you're looking for is very large.
+For example, if you're looking for all occurrences of a string of length 107
+in a longer string of length 109, not having to allocate a table of 107
+machine words for a failure function or shift table is a major win.
+Both Boyer-Moore and KMP use Ω(n) memory on a pattern string of
+length n, so Rabin-Karp would be a clear win here.
+
+Worst-Case Single-Match Efficiency Favors Boyer-Moore or KMP
+Rabin-Karp suffers from two potential worst cases. First,
+if the particular prime numbers used by Rabin-Karp are
+known to a malicious adversary, that adversary could
+potentially craft an input that causes the rolling hash to match the
+hash of a pattern string at each point in time, causing the algorithm's
+performance to degrade to Ω((m - n + 1)n) on a string of length m and
+pattern of length n. If you're taking untrusted strings as input, this
+could potentially be an issue. Neither Boyer-Moore
+nor KMP have these weaknesses.
+
+Worst-Case Multiple-Match Efficiency favors KMP.
+Similarly, Rabin-Karp is slow in the case where you want to find all matches of a
+pattern string in the case where that pattern appears a large number of times.
+For example, if you're looking for a string of 105 copies of the letter a
+in text string consisting of 109copies of the letter a with Rabin-Karp,
+then there will be lots of spots where the pattern string appears,
+and each will require a linear scan. This can also lead
+to a runtime of Ω((m + n - 1)n).
+
+Many Boyer-Moore implementations suffer from this second rule, but will not
+have bad runtimes in the first case.
+And KMP has no pathological worst-cases like these.
+
+Best-Case Performance favors Boyer-Moore
+One advantage of the Boyer-Moore algorithm is that it doesn't necessarily
+have to scan all the characters of the input string. Specifically,
+the Bad Character Rule can be used to skip over huge regions of
+the input string in the event of a mismatch. More specifically,
+the best-case runtime for Boyer-Moore is O(m / n), which is much
+faster than what Rabin-Karp or KMP can provide.
+
+Generalizations to Multiple Strings favor KMP
+Suppose you have a fixed set of multiple text strings that you want to
+search for, rather than just one. You could, if you wanted to, run multiple passes
+of Rabin-Karp, KMP, or Boyer-Moore across the strings to find all the matches.
+However, the runtime of this approach isn't great, as it scales linearly
+with the number of strings to search for. On the other hand, KMP generalizes
+nicely to the Aho-Corasick string-matching algorithm, which runs in time
+O(m + n + z), where z is the number of matches found and n is the
+combined length of the pattern strings. Notice that there's no dependence
+here on the number of different pattern strings being searched for!
+
+###################################################
+
+◮ T = AGCATGCTGCAGTCATGCTTAGGCTA
+◮ P = GCT
+◮ P appears three times in T
+◮ A naive method takes O(mn) time
+– Initiate string comparison at every starting point
+– Each comparison takes O(m) time
+◮ We can do much better!
+
+
+##################################################
+
+HASH TABLE CHECKING:
+
+◮ A function that takes a string and outputs a number
+◮ A good hash function has few collisions
+– i.e., If x 6= y, H(x) 6= H(y) with high probability
+◮ An easy and powerful hash function is a polynomial mod some
+prime p
+– Consider each letter as a number (ASCII value is fine)
+– H(x1 . . . xk) = x1a^(k−1) + x2a^(k−2) + · · · + xk−1a + xk (mod p)
+– How do we find H(x2 . . . xk+1) from H(x1 . . . xk)?
+
+
+Hash Table
+◮ Main idea: preprocess T to speedup queries
+– Hash every substring of length k
+– k is a small constant
+◮ For each query P, hash the first k letters of P to retrieve all
+the occurrences of it within T
+◮ Don’t forget to check collisions!
+
+
+Pros:
+– Easy to implement
+– Significant speedup in practice
+◮ Cons:
+– Doesn’t help the asymptotic efficiency
+◮ Can still take Θ(nm) time if hashing is terrible or data is
+difficult
+– A lot of memory consumption
+
+######################################
+Introduction to suffix trees:
+
+PROGRESSION:
+suffix trie -> suffix ARRAY (https://web.stanford.edu/class/cs97si/10-string-
algorithms.pdf)
+(https://en.wikipedia.org/wiki/Suffix_tree#/media/File:Suffix_tree_BANANA.svg)
+
+
+In computer science, a suffix tree (also called PAT tree or, in an earlier form,
position tree) is a compressed trie
+containing all the suffixes of the given text as their keys and positions in the
text as their values. Suffix trees allow
+particularly fast implementations of many important string operations.
+
+The construction of such a tree for the string S takes time and space linear in
the length of S.
+Once constructed, several operations can be performed quickly, for instance
locating a substring in
+S, locating a substring if a certain number of mistakes are allowed, locating
matches for a
+regular expression pattern etc. Suffix trees also provide one of the first linear-
time solutions for the
+longest common substring problem. These speedups come at a cost: storing a
string's suffix tree typically
+requires significantly more space than storing the string itself.
+
+
+A suffix tree can be viewed as a data structure built on top of a trie where,
instead of just adding the
+string itself into the trie, you would also add every possible suffix of that
string. As an example,
+if you wanted to index the string banana in a suffix tree, you would build a trie
with the following strings:
+
+banana
+anana
+nana
+ana
+na
+a
+Once that's done you can search for any n-gram and see if it is present in your
indexed string. In other words,
+the n-gram search is a prefix search of all possible suffixes of your string.
+
+This is the simplest and slowest way to build a suffix tree. It turns out that
there are many fancier variants on this
+data structure that improve on either or both space and build time. I'm not well
versed enough in this domain to
+give an overview but you can start by looking into suffix arrays or this class
advanced data structures (lecture 16 and 18).
+
+This answer also does a wonderfull job explaining a variant of this data-
structure.
+
+The suffix tree for the string S of length n is defined as a tree such that
+
+ The tree has exactly n leaves numbered from 1 to n.
+ Except for the root, every internal node has at least two children.
+ Each edge is labelled with a non-empty substring of S.
+ No two edges starting out of a node can have string-labels beginning with the
same character.
+ The string obtained by concatenating all the string-labels found on the path
from the root to leaf i spells out suffix S[i..n],
+ for {\displaystyle i}i from {\displaystyle 1}1 to {\displaystyle n}n.
+ Since such a tree does not exist for all strings, {\displaystyle S}S is padded
with a terminal symbol not
+ seen in the string (usually denoted $). This ensures that no suffix is a
prefix of another,
+ and that there will be n leaf nodes, one for each of the n suffixes of S.
Since all internal non-root nodes are
+ branching, there can be at most n − 1 such nodes, and n + (n − 1) + 1 = 2n
nodes in total (n leaves, n − 1 internal non-root nodes, 1 root).
+
+ Suffix links are a key feature for older linear-time construction algorithms,
although most newer algorithms, which are based
+ on Farach's algorithm, dispense with suffix links. In a complete suffix tree,
all internal non-root nodes
+ have a suffix link to another internal node. If the path from the root to a
node spells the string \chi\alpha, where \chi is a single character and
+ \alpha is a string (possibly empty), it has a suffix link to the internal
node representing \alpha . See for example the suffix link from
+ the node for ANA to the node for NA in the figure above. Suffix links are also
used in some algorithms running on the tree.
+
+A generalized suffix tree is a suffix tree made for a set of strings
+instead of a single string. It represents all suffixes from this set of strings.
+Each string must be terminated by a different termination symbol.
+
+If you imagine a Trie in which you put some word's suffixes, you would be able to
query it for the string's substrings
+very easily. This is the main idea behind suffix tree, it's basically a "suffix
trie".
+
+But using this naive approach, constructing this tree for a string of size n would
be O(n^2) and take a lot of memory.
+
+Since all the entries of this tree are suffixes of the same string, they share a
lot of information, so there are
+optimized algorithms that allows you to create them more efficiently. Ukkonen's
algorithm, for example, allows you
+to create a suffix tree online in O(n) time complexity.
+
+
+
+######################################
+Suffix tries (Part 2 progression):
+A suffix tree is a compressed version of a suffix trie.
+A suffix trie is just a trie that contains all the suffixes of a word. With that
trie you can:
+
+• determine whether q is a substring of T?
+ Follow the path for q starting from the root.
+ If you exhaust the query string, then q is in T
+
+• check whether q is a suffix of T?
+ Follow the path for q starting from the root.
+ If you end at a leaf at the end of q, then q is a suffix of T
+
+
+• count how many times q appears in T?
+ Follow the path for q starting from the root.
+ The number of leaves under the node you end up in is the
+ number of occurrences of q.
+
+
+• find the longest repeat in T?
+ Find the deepest node that has at least 2 leaves under it.
+
+-> Find the lexicographically (alphabetically) first suffix:
+ Start at the root, and follow the edge labeled with the
+ lexicographically (alphabetically) smallest letter.
+
+
+
+A node represents the prefix of some
+suffix:
+
+abaaba$
+ __
+
+The node’s suffix link should link to the
+prefix of the suffix s that is 1 character
+shorter.
+Since the suffix trie contains all
+suffixes, it contains a path representing
+s, and therefore contains a node
+representing every prefix of s.
+
+
+
+
+MAIN IDEA: every substring of s is a prefix of some suffix of s.
+
+
+Explanation of Suffix links:
+
+
+One important feature of suffix trees are suffix links. Suffix links are well
defined for suffix trees. If
+there is a node v in the tree with a label ca, where c is a character and a is a
string (non-empty), then the
+suffix link of v points to s(v), which is a node with label a. If a is empty, then
the suffix link of v, i.e,
+s(v) is the root. Suffix links exist for every internal (non-leaf) node of a
suffix tree. This can be easily
+proved. (Refer Lemma 6.1.1, Corollary 6.1.1 and Corollary 6.1.2 of Gusfield).
+Suffix links are similar to failure functions of Aho-Corasick algorithm. By
following these links, one can
+jump from one suffix to another, each suffix starting exactly one character after
the first character of its
+preceding suffix. Thus using suffix links it is possible to get a linear time
algorithm for the previously
+stated problem: matching the longest common substring. It is because suffix links
keep track of what or
+how much of the string has been matched so far. When we see a mismatch, we can
jump along the suffix
+link to see if there is any match later in the text and the string. Not only that,
this jumping trick also helps
+us to save computation time while constructing suffix trees. Thus suffix links are
very important
+constructs of suffix trees. (Read more here:
http://www.cbcb.umd.edu/confcour/Spring2010/CMSC858W-materials/lecture5.pdf)
+
+
+
+-> Find the longest common substring of T and q:
+
+ Walk down the tree following q.
+ If you hit a dead end, save the current depth,
+ and follow the suffix link from the current
+ node.
+ When you exhaust q, return the longest
+ substring found.
+
+
+T = abaaba$ -> suffix tree made for this.
+q = bbaa
+
+We traverse b, reach dead end go to suffix link (end up at root because there is
no prefix to charaacter b in string),
+ traverse b, see a, see another a, so max is 3?
+
+##################################################
+
+CONSTRUCTING SUFFIX TREES:
+
+Suppose we want to build suffix trie for string:
+s = abbacabaa
+We will walk down the string from left to right:
+building suffix tries for s[0], s[0..1], s[0..2], ..., s[0..n]
+
+To build suffix trie for s[0..i], we
+will use the suffix trie for s[0..i-1]
+built in previous step
+
+To convert SufTrie(S[0..i-1]) → SufTrie(s[0..i]), add character s[i] to all the
suffixes:
+
+Need to add nodes for
+the suffixes: (Lets add c)
+
+abbac
+bbac
+bac
+ac
+c
+
+Purple are suffixes that
+will exist in
+SufTrie(s[0..i-1]) Why?
+How can we find these
+suffixes quickly?
+
+Where is the new
+deepest node? (aka
+longest suffix)
+
+How do we add the
+suffix links for the
+new nodes?
+
+To build SufTrie(s[0..i]) from SufTrie(s[0..i-1]):
+
+CurrentSuffix = longest (aka deepest suffix)
+
+Repeat: (Until you reach the root or the current node already has an edge labeled
s[i] leaving it)
+ Add child labeled s[i] to CurrentSuffix.
+ Follow suffix link to set CurrentSuffix to next shortest suffix
+
+
+Add suffix links connecting nodes you just added in
+the order in which you added them. (In practice, you add these links as you go
+along, rather than at the end.)
+Because if you
+already have a node
+for suffix αs[i]
+then you have a
+node for every
+smaller suffix.
+
+##########################################
+
+PYTHON CODE TO BUILD SUFFIX TREES:
+
+class SuffixNode:
+ def __init__(self, suffix_link = None):
+ self.children = {}
+ if suffix_link is not None:
+ self.suffix_link = suffix_link
+ else:
+ self.suffix_link = self
+
+ def add_link(self, c, v):
+ """link this node to node v via string c"""
+ self.children[c] = v
+
+def build_suffix_trie(s):
+ """Construct a suffix trie."""
+ assert len(s) > 0
+
+ # explicitly build the two-node suffix tree
+ Root = SuffixNode() # the root node
+ Longest = SuffixNode(suffix_link = Root)
+ Root.add_link(s[0], Longest)
+
+ # for every character left in the string
+ for c in s[1:]:
+ Current = Longest; Previous = None
+ while c not in Current.children:
+ # create new node r1 with transition Current-c->r1
+ r1 = SuffixNode()
+ Current.add_link(c, r1)
+ # if we came from some previous node, make that
+ # node's suffix link point here
+ if Previous is not None:
+ Previous.suffix_link = r1
+ # walk down the suffix links
+ Previous = r1
+ Current = Current.suffix_link
+ # make the last suffix link
+ if Current is Root:
+ Previous.suffix_link = Root
+ else:
+ Previous.suffix_link = Current.children[c]
+
+ # move to the newly added child of the longest path
+ # (which is the new longest path)
+ Longest = Longest.children[c]
+ return Root
+
+CAN YOU KEEP TAKING NOTES ON SUFFIX TREEES USING FOLLOWING PDF, AFTER READING THE
ABOVE CODE PLS?
+(https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/suffixtrees.pdf)
+Also you can follow this (http://www.cbcb.umd.edu/confcour/Spring2010/CMSC858W-
materials/lecture5.pdf)
+
+
+
+• Suffix tries natural way to store a string -- search, count
+occurrences, and many other queries answerable easily.
+• But they are not space efficient: O(n2) space.
+• Suffix trees are space optimal: O(n), but require a little more
+subtle algorithm to construct.
+• Suffix trees can be constructed in O(n) time using Ukkonen’s
+algorithm.
+• Similar ideas can be used to store sets of strings.
+
+##############################################
+Some other Suffix tree notes
+
+
+◮ Suffix trie of a string T is a rooted tree that stores all the
+suffixes (thus all the substrings)
+◮ Each node corresponds to some substring of T
+◮ Each edge is associated with an alphabet
+◮ For each node that corresponds to ax, there is a special
+pointer called suffix link that leads to the node corresponding
+to x
+◮ Surprisingly easy to implement!
+
+◮ Given the suffix tree for T1 . . . Tn
+ – Then we append Tn+1 = a to T, creating necessary nodes
+◮ Start at node u corresponding to T1 . . . Tn
+ – Create an a-transition to a new node v
+◮ Take the suffix link at u to go to u′, corresponding to T2 . . . Tn
+ – Create an a-transition to a new node v′
+ – Create a suffix link from v to v′
+
+
+◮ Repeat the previous process:
+– Take the suffix link at the current node
+– Make a new a-transition there
+– Create the suffix link from the previous node
+◮ Stop if the node already has an a-transition
+ – Because from this point, all nodes that are reachable via suffix
+ links already have an a-transition
+
+############################################################
+
+SUFFIX ARRAY NOTES INTRO:
+
+◮ Memory usage is O(n)
+◮ Has the same computational power as suffix trie
+◮ Can be constructed in O(n) time (!)
+– But it’s hard to implement
+◮ There is an approachable O(n log2 n) algorithm
+– If you want to see how it works, read the paper on the course
+website
+– http://cs97si.stanford.edu/suffix-array.pdf
+
+Notes on String Problems
+◮ Always be aware of the null-terminators
+◮ Simple hash works so well in many problems
+◮ If a problem involves rotations of some string, consider
+concatenating it with itself and see if it helps
+◮ Stanford team notebook has implementations of suffix arrays
+and the KMP matcher
+
+
+(https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/suffixarrays.pdf)
+
+
+Suffix arrays are a more efficient way to store the suffixes that can do
+most of what suffix trees can do, but just a bit slower.
+• Slight space vs. time tradeoff.
+
+s = attcatg$
+Idea: lexicographically sort
+all the suffixes.
+• Store the starting indices of
+the suffixes in an array
+
+
+attcatg$
+ttcatg$
+tcatg$
+catg$
+atg$
+tg$
+g$
+$
+1
+
+sort the suffixes
+alphabetically
+the indices just
+“come along for
+the ride
+
+[
+8 $
+5 atg$
+1 attcatg$
+4 catg$
+7 g$
+3 tcatg$
+6 tg$
+2 ttcatg$
+]
+
+Use case:
+Does string “at” occur in s?
+• Binary search to find “at”. in suffix array
+• What about “tt”?
+
+[
+8 $
+5 atg$
+1 attcatg$ <- at (yes it does)
+4 catg$
+7 g$
+3 tcatg$
+6 tg$
+2 ttcatg$
+]
+
+How many times does “at”
+occur in the string?
+• All the suffixes that start with
+“at” will be next to each other
+in the array.
+• Find one suffix that starts with
+“at” (using binary search).
+• Then count the neighboring
+sequences that start with at.
+
+K-mer counting
+Problem: Given a string s, an integer k, output all pairs (b, i) such
+that b is a length-k substring of s that occurs exactly i times.
+
+k=2 then:
+ Current Count
+[
+8 $ 1
+5 atg$ 1
+1 attcatg$ 2
+4 catg$ 1 (at, 2)
+7 g$ 2
+3 tcatg$ 1 (ca, 2)
+6 tg$ 1 (t$, 1)
+2 ttcatg$ 1 (tc, 1)
+] 1 (tt, 1)
+
+How to code:
+1. Build a suffix array.
+2. Walk down the suffix array, keeping a
+CurrentCount count
+ If the current suffix has length < k, skip it
+ If the current suffix starts with the same
+ length-k string as the previous suffix:
+ increment CurrentCount
+ else
+ output CurrentCount and previous length-k suffix
+ CurrentCount := 1
+Output CurrentCount & length-k suffix.
+
+
+Constructing Suffix Arrays
+Easy O(n2 log n) algorithm:
+sort the n suffixes, which takes O(n log n) comparisons,
+where each comparison takes O(n).
+
+
+There are several direct O(n) algorithms for constructing suffix
+arrays that use very little space.
+• The Skew Algorithm is one that is based on divide-and-conquer.
+• An simple O(n) algorithm: build the suffix tree, and exploit the
+relationship between suffix trees and suffix arrays (next slide)
+
+The Skew Algorithm
+
+Main idea: Divide suffixes into 3 groups:
+• Those starting at positions i=0,3,6,9,.... (i mod 3 = 0)
+• Those starting at positions 1,4,7,10,... (i mod 3 = 1)
+• Those starting at positions 2,5,8,11,... (i mod 3 = 2)
+
+For simplicity, assume text length is a multiple of 3 after padding
+with a special character.
+mississippi$$
+
+Basic Outline:
+• Recursively handle suffixes from the i mod 3 = 1 and i mod 3 = 2
+groups.
+• Merge the i mod 3 = 0 group at the end.
+
+Handing the 1 and 2 groups
+s = mississippi$$
+1 -> iss iss ipp i$$
+2 -> ssi ssi ppi
+
+triples for groups
+1 and 2 groups
+
+assign each triple
+a token in
+lexicographical
+order
+
+ iss iss ipp i$$ ssi ssi ppi
+t = C C B A E E D
+
+Every suffix of t corresponds
+to a suffix of s.
+
+recursively compute
+the suffix array for
+tokenized string
+
+AEED 4
+BAEED 3
+CBAEED 2
+CCBAEED 1
+D 7
+ED 6
+EED 5
+
+Key Point #1: The order of the suffixes of t is the same as the order of the
+group 1 & 2 suffixes of s.
+
+t = CCBAEED
+ ____ (t4)
+
+Why?
+ Every suffix of t corresponds to some suffix of s (perhaps with some extra
+ letters at the end of it --- in this case EED)
+ Because the tokens are sorted in the same order as the triples, the sort
+ order of the suffix of t matches that of s.
+
+So: The recursive computational of the suffix array for t gives you the ordering
+of the group 1 and group 2 suffixes.
+
+Run Radix Sort:
+O(n)-time sort for n items when items can be divided into
+constant # of digits.
+• Put into buckets based on least-significant digit, flatten, repeat
+with next-most significant digit, etc.
+• Example items: 100 123 042 333 777 892 236
+
+# of passes = # of digits
+• Each pass goes through the numbers once.
+
+
+Handling 0 Suffixes
+• First: sort the group 0 suffixes, using the representation (s[i], Si+1)
+• Since the Si+1 suffixes are already in the array sorted, we can just stably
+sort them with respect to s[i], again using radix sort.
+
+1,2-array: ipp iss iss i$$ ppi ssi ssi
+0-array: mis pi$ sip sis
+
+• We have to merge the group 0 suffixes into the suffix array for group 1 and 2.
+
+Given suffix Si and Sj, need to decide which should come first.
+• If Si and Sj are both either group 1 or group 2, then the recursively
+computed suffix array gives the order.
+• If one of i or j is 0 (mod 3), see next slide.
+
+Comparing 0 suffix Sj with 1 or 2 suffix Si
+
+Represent Si and Sj using subsequent suffixes:
+
+i (mod 3) = 1:
+(s[i],Si+1) < (s[j],Sj+1)
+≣ 2 (mod 3) ≣ 1 (mod 3)
+
+i (mod 3) = 2:
+(s[i],s[i+1],Si+2) < (s[j],s[j+1],Sj+2)
+≣ 1 (mod 3) ≣ 2 (mod 3)
+
+-> the suffixes can be compared quickly because the relative order
+of Si+1, Sj+1 or Si+2, Sj+2 is known from the 1,2-array we already
+computed.
+
+Running Time
+Solves to T(n) = O(n):
+• Expand big-O notation: T(n) ≤ cn + T(2n/3) for some c.
+• Guess: T(n) ≤ 3cn
+• Induction step: assume that is true for all i < n.
+• T(n) ≤ cn + 3c(2n/3) = cn + 2cn = 3cn ☐
+
+
+T(n) = O(n) + T(2n/3)
+time to sort and
+merge
+array in recursive calls
+is 2/3rds the size of
+starting array
+
+
+#################################################################################
+STRING ALGORITHM: Suffix Tries and Suffix Trees
+
+What if we want to search for many patterns P within the same fixed
+text T ?
+
+Idea: Preprocess the text T rather than the pattern P
+Observation: P is a substring of T if and only if P is a prefix of some
+suffix of T .
+We will call a trie that stores all suffixes of a text T a suffix trie, and the
+compressed suffix trie of T a suffix tree.
+
+Build the suffix trie, i.e. the trie containing all the suffixes of the text
+Build the suffix tree by compressing the trie above (like in Patricia
+trees)
+Store two indexes l, r on each node v (both internal nodes and
+leaves) where node v corresponds to substring T [l..r ]
+
+T = bananaban
+SUFFIXES: {bananaban$, ananaban$, nanaban$, anaban$, naban$, aban$, ban$, an$, n$}
+
+Use $ to indicate end of suffixes in trie
+CREATE TRIE!
+then create compressed trie aka suffix tree (and each node contains 2 indexes
l,r )
+
+
+Suffix Trees: Pattern Matching
+ To search for pattern P of length m:
+ Similar to Search in compressed trie with the difference that we are
+ looking for a prefix match rather than a complete match
+ If we reach a leaf with a corresponding string length less than m, then
+ search is unsuccessful
+ Otherwise, we reach a node v (leaf or internal) with a corresponding
+ string length of at least m
+ It only suffices to check the first m characters against the substring of
+ the text between indices of the node, to see if there indeed is a match
+ We can then visit all children of the node to report all matches
+
+
################################################################################
SUFFIX ARRAYS:
In computer science, the longest common prefix array (LCP array) is an
auxiliary
data structure to the suffix array. It stores the lengths of the longest common
prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix
array.
For example, if A := [aab, ab, abaab, b, baab] is a suffix array, the longest
common prefix between A[1] = aab and A[2] = ab is a which has length 1, so H[2]
= 1
in the LCP array H. Likewise, the LCP of A[2] = ab and A[3] = abaab is ab, so
H[3] = 2.
Augmenting the suffix array with the LCP array allows one to efficiently
simulate top-down and bottom-up traversals of the suffix tree,[1][2]
speeds up pattern matching on the suffix array[3] and is a
prerequisite for compressed suffix trees.
Suffix array
Suffix array is a data structure that helps you sort
all the suffixes in lexicography order.
Code :
namespace HashSuffixArray
{
const int
MAXN = 1 << 21;
typedef unsigned long long hash;
int N;
char * S;
int sa[MAXN];
hash h[MAXN], hPow[MAXN];
void buildSA()
{
N = strlen(S);
hPow[0] = 1;
for (int i = 1; i <= N; ++i)
hPow[i] = hPow[i - 1] * BASE;
h[N] = 0;
for (int i = N - 1; i >= 0; --i)
h[i] = h[i + 1] * BASE + S[i], sa[i] = i;
stable_sort(sa, sa + N, sufCmp);
}
Code :
/*
Suffix array O(n lg^2 n)
LCP table O(n)
*/
#include <cstdio>
#include <algorithm>
#include <cstring>
namespace SuffixArray
{
const int MAXN = 1 << 21;
char * S;
int N, gap;
int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN];
void buildSA()
{
N = strlen(S);
REP(i, N) sa[i] = i, pos[i] = S[i];
for (gap = 1;; gap *= 2)
{
sort(sa, sa + N, sufCmp);
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
REP(i, N) pos[sa[i]] = tmp[i];
if (tmp[N - 1] == N - 1) break;
}
}
void buildLCP()
{
for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1)
{
for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];)
++k;
lcp[pos[i]] = k;
if (k)--k;
}
}
} // end namespace SuffixArray
(Codes by mukel)
#######################################################333
FAST KMP EXPLANATION: (https://web.stanford.edu/class/cs97si/10-string-
algorithms.pdf)
i 1 2 3 4 5 6 7 8 9 10
Pi a b a b a b a b c a
π[i] 0 0 1 2 3 4 5 6 0 1
12345678901
ABC ABCDAB ABCDABCDABDE
ABCDABD
Mismatch at T4 again
Mistmatch at T11,
mISMATCH
◮ Currently no letters are matched
◮ Shift P by 0 − π[0] = 1 letter
matched!!
Computing π
◮ Observation 1: if P1 . . . Pπ[i]
is a suffix of P1 . . . Pi, then P1 . . . Pπ[i]−1 is a suffix of P1 . . . Pi−1
– Well, obviously...
◮ A non-obvious conclusion:
– First, let’s write π(k)[i] as π[·] applied k times to i
– e.g., π(2)[i] = π[π[i]]
– π[i] is equal to π(k)[i − 1] + 1, where k is the smallest integer that
satisfies Pπ(k)[i−1]+1 = Pi
-> ◮ If there is no such k, π[i] = 0
Implementation:
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {
while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k;
}
Patten Matching Impl:
int k = 0;
for(int i = 1; i <= n; i++) {
while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k];
}
}
#################################################################################
STRING ALGORITHM: RABIN KARP
Rabin Karp
So Rabin Karp algorithm needs to calculate
hash values for following strings.
1) Pattern itself.
2) All the substrings of text of length m.
Since we need to efficiently calculate hash values for all the substrings
of size m of text, we must have a hash function which has following
property.
Hash at the next shift must be efficiently computable from the current hash
value and next character in text or we can say hash(txt[s+1 .. s+m]) must
be efficiently computable from hash(txt[s .. s+m-1]) and txt[s+m] i.e.,
hash(txt[s+1 .. s+m])= rehash(txt[s+m], hash(txt[s .. s+m-1])) and rehash
must be O(1) operation.
j+=1
# if p == t and pat[0...M-1] = txt[i, i+1, ...i+M-1]
if j==M:
print "Pattern found at index " + str(i)
NO_OF_CHARS = 256
i=0
# ns stores the result which is next state
# ns finally contains the longest prefix
# which is also suffix in "pat[0..state-1]c"
return TF
if __name__ == '__main__':
main()
#################################################################################
STRING ALGORITHM: KMP
Examples:
Pattern: a a a a a
LSP : 0 1 2 3 4
Pattern: a b a b a b
LSP : 0 0 1 2 3 4
Pattern: a b a c a b a b
LSP : 0 0 1 0 1 2 3 2
Pattern: a a a b a a a a a b
LSP : 0 1 2 0 1 2 3 3 3 4
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
lps[] = {0, 1, 2, 3}
i = 0, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 1, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 2, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
pat[i] and pat[j] match, do i++, j++
i = 3, j = 3
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++, j++
i = 4, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
i = 5, j = 2
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1
i = 5, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0
i = 5, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.
i = 6, j = 0
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
i = 7, j = 1
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
if j == M:
print "Found pattern at index " + str(i-j)
j = lps[j-1]
txt = "ABABDABACDABABCABAB"
pat = "ABABCABAB"
KMPSearch(pat, txt)
#################################################################################
STRING ALGORITHM: BOYER MOORE WALKTHROUGH:
PART A:
Based on three key ideas:
Reverse-order searching: Compare P with a subsequence of T moving
backwards
PART B:
The insight behind Boyer-Moore is that if you start searching for a
pattern in a string starting with the last character in the pattern,
you can jump your search forward multiple characters when you hit a mismatch.
E.g.:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
The B-M paper makes the following observations:
(1) if we try matching a character that is not in p then we can jump forward n
characters:
' 's last position in p is 4 from the end, hence we advance 4 characters:
'L' is not in p and the mismatch occurred against p6, hence we can advance (at
least) 6 characters:
PART C:
Now that said, the algorithm is based on a simple principle. Suppose that I'm
trying
to match a substring of length m. I'm going to first look at character at index
m.
If that character is not in my string, I know that the substring
I want can't start in characters at indices 1, 2, ... , m.
Once I start matching from the beginning of the substring, when I find a
mismatch,
I can't just start from scratch. I could be partially through a match starting
at a different point. For instance if I'm trying to match anand in ananand
successfully
match, anan, realize that the following a is not a d, but I've just matched an,
and so
I should jump back to trying to match my third character in my substring. This,
"If I fail
after matching x characters, I could be on the y'th character of a match"
information is stored in the second table.
Note that when I fail to match the second table knows how far along in a match
I might
be based on what I just matched. The first table knows how far back I might be
based
on the character that I just saw which I failed to match. You want to use
the more pessimistic of those two pieces of information.
#################################################################################
STRING ALGORITHM: BOYER Moore BAD CHARACTER HEURISTIC
Boyer-Moore Algorithm
NO_OF_CHARS = 256
'''
Shift the pattern so that the next character in text
aligns with the last occurrence of it in pattern.
The condition s+m < n is necessary for the case when
pattern occurs at the end of text
'''
s += (m-badChar[ord(txt[s+m])] if s+m<n else 1)
else:
'''
Shift the pattern so that the bad character in text
aligns with the last occurrence of it in pattern. The
max function is used to make sure that we get a positive
shift. We may get a negative shift if the last occurrence
of bad character in pattern is on the right side of the
current character.
'''
s += max(1, j-badChar[ord(txt[s+j])])
if __name__ == '__main__':
main()
#################################################################################
STRING ALGORITHM: BOYER Moore GOOD CHARACTER HEURISTIC
https://www.geeksforgeeks.org/boyer-moore-algorithm-good-suffix-heuristic/
#################################################################################
STRING ALGORITHM: BOYER Moore COMBINED
#########################################################################
########################################################################3
Formally a trie is a rooted tree, where each edge of the tree is labeled by
some letter. All outgoing edge from one vertex mush have different labels.
Consider any path in the trie from the root to any vertex. If we write out
the labels of all edges on the path, we get a string that corresponds to
this path. For any vertex in the trie we will associate the string
from the root to the vertex.
Each vertex will also have a flag leaf which will be true, if any string
from the given set corresponds to this vertex.
Accordingly to build a trie for a set of strings means to build a trie such
that each leaf vertex will correspond to one string from the set, and
conversely that each string of the set corresponds to one leaf vertex.
struct Vertex {
int next[K];
bool leaf = false;
Vertex() {
fill(begin(next), end(next), -1);
}
};
vector<Vertex> trie(1);
_______________________
To add strings to trie:
Now we implement a function that will add a string s to the trie.
The implementation is extremely simple: we start at the root node,
and as long as there are edges corresponding to the characters of
s we follow them. If there is no edge for one character, we simply
generate a new vertex and connect it via an edge. At the end of the
process we mark the last vertex with flag leaf.
_______________________
void add_string(string const& s) {
int v = 0;
for (char ch : s) {
int c = ch - 'a';
if (trie[v].next[c] == -1) {
trie[v].next[c] = trie.size();
trie.emplace_back(); // Default construction of the vector!
}
v = trie[v].next[c];
}
trie[v].leaf = true;
}
_______________________
The implementation obviously runs in linear time. And
since every vertex store k links, it will use O(mk) memory.
Construction of an automaton
Suppose we have built a trie for the given set of strings. Now let's
look at it from a different side. If we look at any vertex. The string that
corresponds to it is a prefix of one or more strings in the set, thus
each vertex of the trie can be interpreted as a position in
one or more strings from the set.
For example let the trie be constructed by the strings ab and bc,
and we are currently at the vertex corresponding to ab, which
is a leaf. For a transition with the letter c, we are forced to go
to the state corresponding to the string b, and from
there follow the edge with the letter c.
struct Vertex {
int next[K];
bool leaf = false;
int p = -1;
char pch;
int link = -1;
int go[K];
vector<Vertex> t(1);
int get_link(int v) {
if (t[v].link == -1) {
if (v == 0 || t[v].p == 0)
t[v].link = 0;
else
t[v].link = go(get_link(t[v].p), t[v].pch);
}
return t[v].link;
}
Recall from last time that we needed to construct the trie, and then
set its failure transitions. After the trie is constructed, we traverse the
trie as we are reading in the input text and output the positions that
we find the keywords at. Essentially, these three parts form the structure of
the algorithm.
Note that this trie only handles lowercase words, for simplicity for my
testing.
To add a keyword into the trie, we traverse the longest prefix of
the keyword that exists in the trie starting from the root,
then we add the characters of the rest of the keyword as nodes in the trie, in
a chain.
1 def add_keyword(keyword):
2 """ add a keyword to the trie and mark output at the last node """
3 current_state = 0
4 j = 0
5 keyword = keyword.lower()
6 child = find_next_state(current_state, keyword[j])
7 while child != None:
8 current_state = child
9 j = j + 1
10 if j < len(keyword):
11 child = find_next_state(current_state, keyword[j])
12 else:
13 break
14 for i in range(j, len(keyword)):
15 node = {'value':keyword[i],'next_states':[],'fail_state':0,'output':[]}
16 AdjList.append(node)
17 AdjList[current_state]["next_states"].append(len(AdjList) - 1)
18 current_state = len(AdjList) - 1
19 AdjList[current_state]["output"].append(keyword)
The while loop finds the longest prefix of the keyword which exists in the
trie so far, and will exit when we can no longer match more characters
at index j. The for loop goes through the rest of the keyword, creating
a new node for each character and appending it to AdjList. len(AdjList) - 1
gives the id of the node we are appending, since we are adding to the end of
AdjList.
Consider the node r. We are setting the failure state for node child of r.
Initially the potential parent of the fail state of child,
state will be the next longest proper suffix, which is marked by r's fail
state.
If there is no transition from r's fail state to a node with the same
value as child, then we go to the next longest proper suffix,
which is the fail state of r's fail state, and so on, until we
find one which works, or we are at the root.
We set child's fail state to be this fail state.
1 def set_fail_transitions():
2 q = deque()
3 child = 0
4 for node in AdjList[0]["next_states"]:
5 q.append(node)
6 AdjList[node]["fail_state"] = 0
7 while q:
8 r = q.popleft()
9 for child in AdjList[r]["next_states"]:
10 q.append(child)
11 state = AdjList[r]["fail_state"] # parents fail state
12 while find_next_state(state, AdjList[child]["value"]) == None and
state != 0:
14 state = AdjList[state]["fail_state"]
15 AdjList[child]["fail_state"] = find_next_state(state,
AdjList[child]["value"])
17 if AdjList[child]["fail_state"] is None:
18 AdjList[child]["fail_state"] = 0
19 AdjList[child]["output"] = AdjList[child]["output"] +
AdjList[AdjList[child]["fail_state"]]["output"]
Finally, our trie is constructed. Given an input, line, we iterate through each
character in line, going up to the fail state when we no longer match the next
character in line. At each node, we check to see if there is any output, and
we will capture all the outputted words and their respective indices.
(i-len(j) + 1 is for writing an index at the beginning of the word)
1 def get_keywords_found(line):
2 """ returns true if line contains any keywords in trie """
3 line = line.lower()
4 current_state = 0
5 keywords_found = []
6
7 for i in range(len(line)):
8 while find_next_state(current_state, line[i]) is None and
current_state != 0:
9 current_state = AdjList[current_state]["fail_state"]
10 current_state = find_next_state(current_state, line[i])
11 if current_state is None:
12 current_state = 0
13 else:
14 for j in AdjList[current_state]["output"]:
15 keywords_found.append({"index":i-len(j) + 1,"word":j})
16 return keywords_found
Yay! We are done!
#########################################################################
########################################################################3
Aho-Corasick APPLICATIONS:
Applications
We construct an automaton for this set of strings. We will now process the
text letter by letter, transitioning during the different states. Initially
we are at the root of the trie. If we are at any time at state v, and the
next letter is c, then we transition to the next state with go(v,c),
thereby
either increasing the length of the current match substring by 1,
or decreasing it by following a suffix link.
How can we find out for a state v, if there are any matches with strings
for
the set? First, it is clear that if we stand on a leaf vertex, then
the string corresponding to the vertex ends at this position in the text.
However this is by no means the only possible case of achieving a match:
if we can reach one or more leaf vertices by moving along the suffix links,
then there will be also a match corresponding to each found leaf vertex.
A simple example demonstrating this situation can be creating using the
set of strings {dabce,abc,bc} and the text dabc.
Thus if we store in each leaf vertex the index of the string corresponding
to it (or the list of indices if duplicate strings appear in the set),
then we can find in O(n) time the indices of all strings which match
the current state, by simply following the suffix links from the current
vertex to the root. However this is not the most efficient solution,
since this gives us O(n len) complexity in total. However this can
be optimized by computing and storing the nearest leaf vertex that
is reachable using suffix links (this is sometimes called the exit link).
This value we can compute lazily in linear time. Thus for
each vertex we can advance in O(1) time to the next marked
vertex in the suffix link path, i.e. to the next match. Thus for each
match we spend O(1) time, and therefore we reach the complexity O(len+ans).
If you only want to count the occurrences and not find the indices
themselves,
you can calculate the number of marked vertices in the suffix link path for
each
vertex v. This can be calculated in O(n) time in total. Thus we can sum up
all matches in O(len).
We can construct the automaton for the set of strings. Let's remember,
that the vertices from which we can reach a leaf vertex are the states,
at which we have a match with a string from the set. Since in this task
we have to avoid matches, we are not allowed to enter such states.
On the other hand we can enter all other vertices. Thus we delete all
"bad" vertices from the machine, and in the remaining graph of the
automaton
we find the lexicographical smallest path of length L. This task can be
solved
in O(L) for example by depth first search.
Problems
UVA #11590 - Prefix Lookup
UVA #11171 - SMS
UVA #10679 - I Love Strings!!
Codeforces - x-prime Substrings
Codeforces - Frequency of String
CodeChef - TWOSTRS
################################################################
################################################################3
Suppose we are given a string s of length n. The Z-function for this string
is an array of length n where the i-th element is equal to the greatest
number of characters starting from the position i that
coincide with the first characters of s.
Examples
For example, here are the values of the Z-function computed for different
strings:
"aaaaa" - [0,4,3,2,1]
"aaabaab" - [0,2,1,0,2,1,0]
"abacaba" - [0,0,1,0,3,0,1]
Trivial algorithm
Formal definition can be represented in the following elementary O(n2)
implementation.
vector<int> z_function_trivial(string s) {
int n = (int) s.length();
vector<int> z(n);
for (int i = 1; i < n; ++i)
while (i + z[i] < n && s[z[i]] == s[i + z[i]])
++z[i];
return z;
}
We just iterate through every position i and update z[i] for each one of them,
starting from z[i]=0 and incrementing it as long as we don't find
a mismatch (and as long as we don't reach the end of the line).
For the sake of brevity, let's call segment matches those substrings that
coincide with a prefix of s. For example, the value of the desired
Z-function z[i] is the length of the segment match starting at
position i (and that ends at position i+z[i]−1).
To do this, we will keep the [l,r] indices of the rightmost segment match.
That is, among all detected segments we will keep the one that ends
rightmost. In a way, the index r can be seen as the "boundary" to
which our string s has been scanned by the algorithm;
everything beyond that point is not yet known.
Then, if the current index (for which we have to compute the next
value of the Z-function) is i, we have one of two options:
i>r -- the current position is outside of what we have already processed.
We will then compute z[i] with the trivial algorithm (that is, just comparing
values one by one). Note that in the end, if z[i]>0, we'll have to
update the indices of the rightmost segment, because it's
guaranteed that the new r=i+z[i]−1 is better than the previous r.
i≤r -- the current position is inside the current segment match [l,r].
Then we can use the already calculated Z-values to "initialize" the value
of z[i] to something (it sure is better than "starting from zero"),
maybe even some big number.
For this, we observe that the substrings s[l…r] and s[0…r−l] match.
This means that as an initial approximation for z[i] we can take
the value already computed for the corresponding segment s[0…r−l],
and that is z[i−l].
s="aaaabaa"
When we get to the last position (i=6), the current match
segment will be [5,6]. Position 6 will then match position 6−5=1,
for which the value of the Z-function is z[1]=3. Obviously, we
cannot initialize z[6] to 3, it would be completely incorrect.
The maximum value we could initialize it to is 1 -- because it's the
largest value that doesn't bring us beyond the index r of the match segment
[l,r].
z0[i]=min(r−i+1,z[i−l])
After having z[i] initialized to z0[i], we try to increment z[i] by
running the trivial algorithm -- because in general, after the border r,
we cannot know if the segment will continue to match or not.
Thus, the whole algorithm is split in two cases, which differ only in
the initial value of z[i]: in the first case it's assumed to be zero,
in the second case it is determined by the previously computed
values (using the above formula). After that, both branches of this
algorithm can be reduced to the implementation of the trivial algorithm,
which starts immediately after we specify the initial value.
The algorithm turns out to be very simple. Despite the fact that on
each iteration the trivial algorithm is run, we have made significant
progress, having an algorithm that runs in linear time. Later on we will
prove that the running time is linear.
Implementation
Implementation turns out to be rather laconic:
vector<int> z_function(string s) {
int n = (int) s.length();
vector<int> z(n);
for (int i = 1, l = 0, r = 0; i < n; ++i) {
if (i <= r)
z[i] = min (r - i + 1, z[i - l]);
while (i + z[i] < n && s[z[i]] == s[i + z[i]])
++z[i];
if (i + z[i] - 1 > r)
l = i, r = i + z[i] - 1;
}
return z;
}
Array z is initially filled with zeros. The current rightmost match segment
is assumed to be [0;0] (that is, a deliberately small segment which doesn't
contain any i).
Inside the loop for i=1…n−1 we first determine the initial value z[i] --
it will either remain zero or be computed using the above formula.
In the end, if it's required (that is, if i+z[i]−1>r), we update the rightmost
match segment [l,r].
###################################################################
#####################################################################
z Algorithm Applications:
(https://cp-algorithms.com/string/z-function.html)
Applications
We will now consider some uses of Z-functions for specific tasks.
To solve this problem, we create a new string s=p+⋄+t, that is, we apply
string concatenation to p and t but we also put a separator character ⋄
in the middle (we'll choose ⋄ so that it will certainly not be
present anywhere in the strings p or t).
Compute the Z-function for s. Then, for any i in the interval [0;length(t)−1],
we will consider the corresponding value k=z[i+length(p)+1]. If k is equal
to length(p) then we know there is one occurrence of p in the i-th position
of t, otherwise there is no occurrence of p in the i-th position of t.
Take a string t=s+c and invert it (write its characters in reverse order).
Our task is now to count how many prefixes of t are not found anywhere else in
t.
Let's compute the Z-function of t and find its maximum value zmax. Obviously,
t's prefix of length zmax occurs also somewhere in the middle of t. Clearly,
shorter prefixes also occur.
So, we have found that the number of new substrings that appear when symbol c
is
appended to s is equal to length(t)−zmax.
Consequently, the running time of this solution is O(n2) for a string of length
n.
It's worth noting that in exactly the same way we can recalculate, still in
O(n) time, the number of distinct substrings when appending a character in the
beginning of the string, as well as when removing it (from the end or the
beginning).
String compression
Given a string s of length n. Find its shortest "compressed" representation,
that is: find a string t of shortest length such that s can be represented
as a concatenation of one or more copies of t.
A solution is: compute the Z-function of s, loop through all i such that i
divides n. Stop at the first i such that i+z[i]=n. Then, the string s
can be compressed to the length i.
The proof for this fact is the same as the solution which uses the prefix
function.
#########################################################################
########################################################################3
Z algorithm (https://www.hackerearth.com/practice/algorithms/string-algorithm/z-
algorithm/tutorial/)
z [ 0 ] = 0
Examples
For example, here are the values of the Z-function computed for different
strings:
s = 'aaaaa'
Z[0] Z[1] Z[2] Z[3] Z[4]
0 4 3 2 1
s = 'aaabaab'
Z[0] Z[1] Z[2] Z[3] Z[4] Z[5] Z[6]
0 2 1 0 2 1 0
s = 'abacaba'
Z[0] Z[1] Z[2] Z[3] Z[4] Z[5] Z[6]
0 0 1 0 3 0 1
Trivial algorithm
vector<int> z_function_trivial(string s)
{
int n = (int) s.length();
vector<int> z(n);
for (int i = 1; i < n; ++i)
while (i + z[i] < n && s[z[i]] == s[i + z[i]])
++z[i];
return z;
}
We just iterate through every position and update for each one of them,
starting from and incrementing it as long as we do not
find a mismatch (and as long as we do not reach the end of the line).
Efficient algorithm
The idea is to maintain an interval [L, R] which is the interval with max R
such that [L,R] is prefix substring (substring which is also prefix).
2) If i <= R then let K = i-L, now Z[i] >= min(Z[K], R-i+1) because
str[i..] matches with str[K..] for atleast R-i+1 characters (they are in
[L,R] interval which we know is a prefix substring).
Now two sub cases arise –
a) If Z[K] < R-i+1 then there is no prefix substring starting at
str[i] (otherwise Z[K] would be larger) so Z[i] = Z[K] and
interval [L,R] remains same.
b) If Z[K] >= R-i+1 then it is possible to extend the [L,R] interval
thus we will set L as i and start matching from str[R] onwards and
get new R then we will update interval [L,R] and calculate Z[i] (=R-
L+1).
Implementation
// returns array z[] where z[i] is z-function of s[i]
int[] zFucntion(String s) {
int n = s.length();
int z[] = new int[n];
int R = 0;
int L = 0;
for(int i = 1; i < n; i++) {
z[i] = 0;
if (R > i) {
z[i] = Math.min(R - i, z[i - L]);
}
while (i + z[i] < n && s.charAt(i+z[i]) == s.charAt(z[i])) {
z[i]++;
}
if (i + z[i] > R) {
L = i;
R = i + z[i];
}
}
z[0] = n;
return z;
}
Complexity
Worst case time complexity: Θ(N)
Average case time complexity: Θ(N)
Best case time complexity: Θ(N)
Space complexity: Θ(log N)
Applications
PLEASE COVER:
Boyer moore good character heuristic/bad char heuristic
Aho-Corasick Algorithm for Pattern Searching
Suffix Tree/Suffix Array
Manachars algorithm
(https://www.hackerearth.com/practice/algorithms/string-algorithm/manachars-
algorithm/tutorial/)
##############################################################################
Tries
[THE BELOW CODE DOESNT MAKE SENSE, DOESNT WORK, DONT WASTE TIME ON IT]
Tries are some kind of rooted trees in which each edge has a character on it.
Actually, trie is some kind of DFA (Determining Finite Automata). For a bunch
of
strings, their trie is the smallest rooted tree with a character on each edge
and
each of these strings can be build by writing down the characters in the path
from the root to some node.
It's advantage is, LCP (Longest Common Prefix) of two of these strings is the
LCA (Lowest Common Ancestor) of their nodes in the trie(a node that we can
build the string by writing down the characters in the
path from the root to that node).
let f[k] be the list of links for the k-th node (NOT ASCII LETTER NODE
niggaaHHH)
let f[k][x] = m, the node who represents the son of k-th node
using x-th character, m = -1 is there is not a link.
f[MAX][CHARSET]
void init() {
fill(f, -1);
}
void insert(char [] s) {
int node = ROOT;
for (int i = 0; i < size(s); i++) {
if ( f[node][ s[i] ] == -1 )
f[node][ s[i] ] = sz++;
node = f[node][ s[i] ];
}
}
Notes: Root node is at f[0] sz is the numbers of nodes currently in trie