String Algorithm
String Algorithm
There are many practical applications of string processing, such as processing encoded information in
the information processing, processing genetic codes that encoded by four characters (A, C, T, and G)
in the computational biology, processing data exchange in communications system, the study of
describing sets of strings which is belong to theory of formal languages in programming systems.
1. String Sorts
This section discuss how we solve sorting problem in the case of string. Does not like any general
sorting algorithms that we already knew, sorting the string has different method to achieve better
performance. On the other hand, we should have insight about the string before execute any sorting
algorithm. The useful insight that may important, such as, randomness, longest common prefixes,
encoding used in the string, etc. Without enough insight about the string we dealing, our sorting
algorithm may not performed well as expected.
Key-Indexed Counting
Before we discuss any string algorithms, we need to
understand how to indexing set of string-integer pairs. An
example of string-integer pairs illustrated in the left.
Intuitively, the idea of key-indexed is sort string by its key.
That’s why we can utilized this method for complex string
sort discussed later. Below, the implementation of key-
indexed counting in java:
int N = a.length;
The running time of LSD is ~7WN+3WR and extra space proportional to N+R.
We maintenance cutoff value to identify small array, so we can improve running time of sorting in
MSD by implementing insertion sort for small array. Another aspect that we should concern about is
radix range in MSD. It is okay for standard ASCII which has R=256, but for UNICODE which has
R=65536, it is not good.
The running time of MSD is between 8N+3R and ~7wN+3WR, where w is the average string length.
2. Tries
In this section, we dealing with string search problem. A primitive algorithm such as BST or Red-
Black BST has been discovered as base performance in the case of searching set of integers or
characters. In string cases, we look for improvement opportunity by looking at the pattern such as some
of string may share common longest prefix or has uniformly distributed length. The tries is pattern
aware, thus offered suitable data structure that optimizing string search running time. It is like
compression method, but in this case we highly used graph theory.
Too avoid confusion with term ‘tree’, a term trie came from word retrieval which is introduced by
E.Fredkin in 1960. It is little bit like a wordplay same as term ‘dynamic programming’ that refer to
mathematical concept, not exactly about computer programming.
Image 4 illustrated the anatomy of a trie correspond to small set of words. To reproduce a trie
construction illustrated in image 4, we need to following rules:
1. A root is a node with null values.
2. All leafs are nodes that do not have
child or followed by null link.
3. A trie correspond to map of key-
value pair, in this case, key is string
and value is integer of string
identifier.
4. A string may a prefix or suffix of
another string, while a complete
string is a path from root to any leaf.
For example, a word ‘she’ is a prefix
of word ‘shells’ that each terminated
by id 0 and 3.
R-Way Trie
The idea of R-way trie is we take R possible value to create each parent node, so we can satisfy
randomness of input string cases. For example, if we need to indexing any ASCII based words and all
the words are distributed uniformly, then at least we need to satisfy LR possible combinations, where L
is average length of string. Image 5 illustrated the construction of R-Way trie for 256 ASCII characters
of word sea, shells, and she.
3. Substring Search
Let’s make a leap, since brute force implementation is not consideration because its running time
(~NM), our attention here is more efficient algorithms.
/**
* Will print:
* % java KMP AACAA AABRAACADABRAACAADABRA
* text: AABRAACADABRAACAADABRA
* pattern: AACAA
*/
public static void main(String[] args)
{
String pat = args[0];
String txt = args[1];
KMP kmp = new KMP(pat);
StdOut.println("text: " + txt);
int offset = kmp.search(txt);
StdOut.print("pattern: ");
for (int i = 0; i < offset; i++)
StdOut.print(" ");
StdOut.println(pat);
}
}
Worst case processing time of KMP is N+M, since M time for preprocessing and N time for searching.
class BoyerMoore
{
private int[] right;
private String pat;
BoyerMoore(String pat)
{ // Computer skip table
this.pat = pat;
int M = pat.length();
int R = 256;
right = new int[R];
for (int c=0; c<R; c++)
right[c] = -1;
for (int j=0; j<M; j++)
right[pat.charAt(j)] = j;
}
/**
* Will print:
* % java BoyerMoore AACAA AABRAACADABRAACAADABRA
* text: AABRAACADABRAACAADABRA
* pattern: AACAA
*/
public static void main(String[] args)
{
String pat = args[0];
String txt = args[1];
BoyerMoore boyerMoore = new BoyerMoore(pat);
StdOut.println("text: " + txt);
int offset = boyerMoore.search(txt);
StdOut.print("pattern: ");
for (int i = 0; i < offset; i++)
StdOut.print(" ");
StdOut.println(pat);
}
}
The typical implementation of Boyer-Moore algorithm like code above will guarantee worst case
running time to NM. Furthermore, a full implementation of Boyer-Moore will provide linear-time
worst-case guarantee by implementing KMP-like table. If we look into Image 10, its like we can
choose better skip value by implementing KMP-like array rather than simply decrement index of j.
To avoid recalculation, we can derived formula above to compute next substring by subtracting one left
most binomial value, increase the order of rest binomial and added a constant at last:
x i + 1 =(x i − t i RM − 1) R + t i + M
To avoid exhaustive computation and reduce space (since length of integer 2 31), we only need to
reproduce the remainder of each hash value. This method called as modular hashing.
H (x i ) = x i mod Q
where Q is any prime number. This calculation is similar to calculation in hash table, but in this case,
we do not store each hash value in a table, we only care of the hash value for each substring. Since
choose for right value of Q is not trivial problem, we can guarantee that by choosing large enough
value for Q, the probability of collision is 1/Q. This approach called as by Monte Carlo correctness.
But, if in the case we use defensive approach, then we should ensure that matching always correct. In
that case we used approach called as Las Vegas correctness. This approach required back-up for
testing of correctness operation. Graphical implementation of Rabin-Karp algorithm illustrated in
image below.
// Monte carlo
// public boolean check(int i)
// {
// return true;
// }
// Las Vegas
public boolean check(String txt, int i)
{
for (int j = 0; j < M; j++)
if (pat.charAt(j) != txt.charAt(i+j))
return false;
return true;
}
int offset = i - M + 1;
if (patHash == txtHash && check(txt, offset))
return offset; // Match
}
return N; // No match found
}
}
Rabin-Karp substring search is known as fingerprint search because it uses a small amount of
information to represent a pattern. Then it looks for this fingerprint (the hash value) in the text. The
algorithm is efficient because the fingerprints can be efficiently computed and compared.
Summary
Algorithm Version Guarantee Typical Backup? Correct? Extra space
Brute force - MN 1.1 N yes yes 1
Full DFA 2N 1.1 N no yes MR
Mismatch transition
KMP 3N 1.1 N no yes M
only
Full algorithm 3N N/M yes yes R
Boyer-Moore Mismatch char heuristic MN N/M yes yes R
Monte Carlo 7N 7N no yes* 1
Rabin-Karp
Las vegas 7N* 7N yes yes 1