Week 9 String Algorithms, Approximation
Week 9 String Algorithms, Approximation
Strings
2/90
Strings
A string is a sequence of characters.
Examples of strings:
!"C program
!"HTML document
!"DNA sequence
!"Digitised image
Examples of alphabets:
!"ASCII
!"Unicode
!"{0,1}
!"{A,C,G,T}
Notation:
!"length(P) … #characters in P
!"λ … empty string (length(λ) = 0)
!"Σm … set of all strings of length m over alphabet Σ
!"Σ* … set of all strings over alphabet Σ
Notation:
Pattern Matching
9/90
Pattern Matching
Example (pattern checked backwards):
!"Text … abacaab
!"Pattern … abacab
!"Text editors
!"Search engines
!"Biological research
NaiveMatching(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| for all i=0..n-m do
| | j=0 // check from left to right
| | while j<m and T[i+j]=P[j] do // test ith shift of pattern
| | | j=j+1
| | | if j=m then
| | | return i // entire pattern checked
| | | end if
| | end while
| end for
| return -1 // no match found
12/90
Analysis of Naive Pattern Matching
Naive pattern matching runs in O(n·m)
!"T = aaa…ah
!"P = aaah
!"may occur in DNA sequences
!"unlikely in English text
When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]
Example:
abcdabcdeabcc abcdabcdeabcc
abcdexxxxxxxx xxxxabcde
15/90
Boyer-Moore Algorithm
The Boyer-Moore pattern matching algorithm is based on two heuristics: **Search starting from BACK*
continue from the back EVERY TIME
!"Looking-glass heuristic: Compare P with subsequence of T moving backwards
!"Character-jump heuristic: When a mismatch occurs at T[i]=c
#"if P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i]
Example:
!"last-occurrence function L
#"L maps Σ to integers such that L(c) is defined as
$"the largest index i such that P[i]=c, or
$"-1 if no such index exists
c a b c d
L(c) 2 3 1 -1
!"L can be represented by an array indexed by the numeric codes of the characters
!"L can be computed in O(m+s) time (m … length of pattern, s … size of Σ)
BoyerMooreMatch(T,P,Σ):
| Input text T of length n, pattern P of length m, alphabet Σ
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| L=lastOccurenceFunction(P,Σ)
| i=m-1, j=m-1 // start at end of pattern
| repeat
| | if T[i]=P[j] then
| | | if j=0 then
| | | return i // match found at i
| | | else
| | | i=i-1, j=j-1 // keep comparing
| | | end if
| | else // character-jump
| | | i=i+m-min(j,1+L[T[i]])
| | | j=m-1
| | end if
| until i≥n
| return -1 // no match
i-m: big jump
!"Biggest jump (m characters ahead) occurs when L[T[i]] = -1
c a b c d
L(c) 4 5 3 -1
13 comparisons in total
23/90
Knuth-Morris-Pratt Algorithm
**Search starting from FRONT*
The Knuth-Morris-Pratt algorithm … continue on current i
Reminder:
!"what is the most we can shift the pattern to avoid redundant comparisons?
!"Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]
... Knuth-Morris-Pratt Algorithm 25/90
KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself
Example: P = abaaba
j 0 1 2 3 4 5
Pj a b a a b a
F(j) 0 0 1 1 2 3
KMPMatch(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| F=failureFunction(P)
| i=0, j=0 // start from left
| while i<n do
| | if T[i]=P[j] then
| | | if j=m-1 then
| | | return i-j // match found at i-j
| | | else
| | | i=i+1, j=j+1 // keep comparing
| | | end if
| | else if j>0 then // mismatch and j>0?
| | | j=F[j-1] // → shift pattern to i-F[j-1]
| | else // mismtach and j still 0?
| | | i=i+1 // → begin at next text character
| | end if
| end while
| return -1 // no match
Pj a b a c a b
F(j) 0 0 1 0 1 2
19 comparisons in total
failureFunction(P):
| Input pattern P of length m
| Output failure function for P
|
| F[0]=0 // F[0] is always 0
| j=1, len=0
| while j<m do
| | if P[j]=P[len] then
| | len=len+1 // we have matched len+1 characters
| | F[j]=len // P[0..len-1] = P[len-1..j]
| | j=j+1
| | else if len>0 then // mismatch and len>0?
| | len=F[len-1] // → use already computed F[len] for new len
| | else // mismatch and len still 0?
| | F[j]=0 // → no prefix of P[0..j] is suffix of P[1..j]
| | j=j+1 // → continue with next pattern character
| | end if
| end while
| return F
Exercise #5: 31/90
⇒ F[0]=0
j=1, len=0, P[1]≠P[0] ⇒ F[1]=0
j=2, len=0, P[2]=P[0] ⇒ len=1, F[2]=1
j=3, len=1, P[3]≠P[1] ⇒ len=F[0]=0
j=3, len=0, P[3]=P[0] ⇒ len=1, F[3]=1
j=4, len=1, P[4]=P[1] ⇒ len=2, F[4]=2
j=5, len=2, P[5]=P[2] ⇒ len=3, F[5]=3
34/90
Boyer-Moore vs KMP
Boyer-Moore algorithm
!"decides how far to jump ahead based on the mismatched character in the text
!"works best on large alphabets and natural language texts (e.g. English)
Knuth-Morris-Pratt algorithm
!"uses information embodied in the pattern to determine where the next match could begin
!"works best on small alphabets (e.g. A,C,G,T)
For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size of alphabet
36/90
Preprocessing Strings
Preprocessing the pattern speeds up pattern matching queries
!"After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length
If the text is large, immutable and searched for often (e.g., works by Shakespeare)
!"we can preprocess the text instead of the pattern immutable: not change
A trie …
Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"
38/90
Tries
Tries are trees organised using parts of keys (rather than whole keys)
How many words are encoded in the trie on the previous slide? depends on how many red nodes
11
#define ALPHABET_SIZE 26
Note: Can also use BST-like nodes for more space-efficient implementation of tries
44/90
Trie Operations
Basic operations on tries:
find(trie,key):
| Input trie, key
| Output pointer to element in trie if key found
| NULL otherwise
|
| node=trie
| for each char in key do
| | if node.child[char] exists then
| | node=node.child[char] // move down one level
| | else
| | return NULL
| | end if
| end for
| if node.finish then // "finishing" node reached?
| return node
| else
| return NULL
| end if
insert(trie,item,key):
| Input trie, item with key of length m
| Output trie with item inserted
|
| if trie is empty then
| t=new trie node
| end if
| if m=0 then
| t.finish=true, t.data=item
| else
| t.child[key[0]]=insert(t.child[key[0]],item,key[1..m-1])
| end if
| return t
!"O(n) space
!"insertion and search in time O(m)
#"n … total size of text (e.g. sum of lengths of all strings in a given dictionary)
#"m … size of the string parameter of the operation (the "key")
52/90
Word Matching with Tries
Preprocessing the text:
54/90
Compressed Tries
Compressed tries …
Example:
Exercise #8: Compressed Tries 55/90
How many nodes (including the root) are needed for the compressed trie?
57/90
Pattern Matching With Suffix Tries
The suffix trie of a text T is the compressed trie of all the suffixes of T
Example:
Compact representation:
... Pattern Matching With Suffix Tries 59/90
Input:
Goal:
suffixTrieMatch(trie,P):
| Input compact suffix trie for text T, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| j=0, v=root of trie
| repeat
| | // we have matched j characters
| | if ∃w∈children(v) such that P[j]=T[start(w)] then
| | | i=start(w) // start(w) is the start index of w
| | | x=end(w)-i+1 // end(w) is the end index of w
| | | if m≤x then // length of suffix ≤ length of the node label?
| | | if P[j..j+m-1]=T[i..i+m-1] then
| | | return i-j // match at i-j
| | | else
| | | return -1 // no match
| | | else if P[j..j+x-1]=T[i..i+x-1] then
| | | j=j+x, m=m-x // update suffix start index and length
| | | v=w // move down one level
| | | else return -1 // no match
| | | end if
| | else
| | return -1
| | end if
| until v is leaf node
| return -1 // no match
Text Compression
65/90
Text Compression
Problem: Efficiently encode a given string X by a smaller string Y
Applications:
Huffman's algorithm
Prefix code … binary code such that no code word is prefix of another code word
Encoding tree …
Example:
Given a text T, find a prefix code that yields the shortest encoding of T
01011011010000101001011011010 vs 001011000100001100101100
71/90
Huffman Code
Huffman's algorithm
Example: abracadabra
HuffmanCode(T):
| Input string T of size n
| Output optimal encoding tree for T
|
| compute frequency array
| Q=new priority queue
| for all characters c do
| | T=new single-node tree storing c
| | join(Q,T) with frequency(c) as key
| end for
| while |Q|≥2 do
| | f1=Q.minKey(), T1=leave(Q)
| | f2=Q.minKey(), T2=leave(Q)
| | T=new tree node with subtrees T1 and T2
| | join(Q,T) with f1+f2 as key
| end while
| return leave(Q)
Construct a Huffman tree for: a fast runner need never be afraid of the dark
!"O(n+d·log d) time
#"n … length of the input text T
#"d … number of distinct characters in T
Approximation
77/90
Approximation for Numerical Problems
Approximation is often used to solve numerical problems by
Examples:
!"roots of a function f
!"length of a curve determined by a function f
!"… and many more
... Approximation for Numerical Problems 78/90
bisection(f,x1,x2):
| Input function f, interval [x1,x2]
| Output x∈[x1,x2] with f(x)≅0
|
| repeat
| | mid=(x1+x2)/2
| | if f(x1)*f(mid)<0 then
| | x2=mid // root to the left of mid
| | else
| | x1=mid // root to the right of mid
| | end if
| until f(mid)=0 or x2-x1<ε // ε: accuracy
| end while
| return mid
bisection guaranteed to converge to a root if f continuous on [x1,x2] and f(x1) and f(x2) have opposite signs
length=0, δ=(end-start)/StepSize
for each x∈[start+δ,start+2δ,..,end] do
length = length + sqrt(δ2 + (f(x)-f(x-δ))2)
end for
81/90
Approximation for NP-hard Problems
Approximation is often used for NP-hard problems …
Examples:
82/90
Vertex Cover
Reminder: Graph G = (V,E)
!"set of vertices V
!"set of edges E
Vertex cover C of G …
!"C ⊆ V
!"for all edges (u,v) ∈ E either v ∈ C or u ∈ C (or both)
Applications:
Theorem.
Determining whether a graph has a vertex cover of a given size k is an NP-complete problem.
approxVertexCover(G):
| Input undirected graph G=(V,E)
| Output vertex cover of G
|
| C=∅
| unusedE=E
| while unusedE≠∅
| | choose any (v,w)∈unusedE
| | C = C∪{v,w}
| | unusedE = unusedE\{all edges incident on v or w}
| end while
| return C
Possible result:
Theorem.
The approximation algorithm returns a vertex cover at most twice the size of an optimal cover.
Proof. Any (optimal) cover must include at least one endpoint of each chosen edge.
Cost analysis …
90/90
Summary
!"Alphabets and words
!"Pattern matching
#"Boyer-Moore, Knuth-Morris-Pratt
!"Tries
!"Text compression
#"Huffman code
!"Approximation
#"numerical problems
#"vertex cover
!"Suggested reading:
#"tries … Sedgewick, Ch. 15.2
#"approximation … Moffat, Ch. 9.4