0% found this document useful (0 votes)
85 views25 pages

The Zhu-Takaoka Algorithm: Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang

The document describes the Zhu-Takaoka string matching algorithm, which is a variant of the Boyer-Moore algorithm. It improves the bad character rule of Boyer-Moore by using a 2-substring rule. The algorithm preprocesses the pattern to construct a ztBc table containing the rightmost occurrence of each 2-substring. During matching, it uses the ztBc table or good suffix rules to determine the shift amount. The preprocessing takes O(m + σ^2) time and space, and searching takes O(m*n) time, where m is the pattern length, n is the text length, and σ is the alphabet size.

Uploaded by

Akasuki Zian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views25 pages

The Zhu-Takaoka Algorithm: Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang

The document describes the Zhu-Takaoka string matching algorithm, which is a variant of the Boyer-Moore algorithm. It improves the bad character rule of Boyer-Moore by using a 2-substring rule. The algorithm preprocesses the pattern to construct a ztBc table containing the rightmost occurrence of each 2-substring. During matching, it uses the ztBc table or good suffix rules to determine the shift amount. The preprocessing takes O(m + σ^2) time and space, and searching takes O(m*n) time, where m is the pattern length, n is the text length, and σ is the alphabet size.

Uploaded by

Akasuki Zian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

The Zhu-Takaoka 

Algorithm 
On improving the average case of the Boyer-Moore string 
matching algorithm, Journal of Information Processing 
10(3):173-177, 1987 R. F. ZHU, T. TAKAOKA 

Advisor: Prof. R. C. T. Lee 


Speaker: S. Y. Tang 

 
• The Zhu-Takaoka Algorithm is an 
algorithm which solves the string 
matching problem. 
• String matching problem: 
Input: a text string T of length n and a 
pattern string P 
of length m. 
Output: all occurrences of P which occur 
in T. 

 
• The Zhu-Takaoka Algorithm is a 
variant of the Boyer and Moore 
Algorithm. The algorithm only improve 
the bad character of the Boyer and Moore 
Algorithm. 
• Zhu and Takaoka modified the BM 
Algorithm. They replaced the bad 
character rule by a 2-substring rule . The 
good suffix rules are still used. 

 
The 2-Substring Rule 
• Consider text=ACTGCTAAGTA and 
pattern=CTAAG. 
0 1 2 3 4 5 6 7 8 9 10 11 Text 
A C T G C C T A A G T A 

Pattern 
C T A A G 

No GC appears in P. 
0 1 2 3 4 5 6 7 8 9 10 11 Text 
A C T G C C T A A G T A 

Pattern 
C T A A G 
0 1 2 3 4 5 6 7 8 9 10 11 Text 
A C T G C C T A A G T A 

Pattern 
C T A A G 

 
How can we know whether a 
specified 2-substring appears 
in P or not? 

 
Whenever a mismatch or a complete match 
occurs, we select 
the last 2-substring in T and search for the 
rightmost location of this 2-substring in P if it 
exists. This is done by constructing a ztBc table. 
• Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C A G A G G A T A T A C A G T A C G 

Pattern 
G C A G A G A G 
Shift by 5 
G C A G A G A G 
Shift by 1 
G C A G A G A G 
ztBc A C G * 

T(CA)=5 means that CA appears 


in 5 locations A 8 8 2 8 
from the right end. Thus we can 
shift by 5. C 5 8 7 8 
T(GA)=1 means that GA 
appears in 1 location from G 1 6 7 8 
the right end. If GA is the 
2-substring to be * 8 8 7 8 
matched, we shift 1 step. 

 
ztBc[a,b] 
The preprocessing phase of the algorithm consists in 
computing for each pair of characters (a, b) with a, b ∑∈ 
the rightmost 
occurrence of ab in x [ 0..m -2] 
,For 
ba 
∑∈ 
kbaztBc 
,[ 
] = 
⇔  
-mk 
bx-mk 
≤ 


x[and k-mk-m .. + 1] = ab and 
ab does not occur [in -mk-mx + 2.. 2] , 1 [0] = and 
ab does not occur  

in 
-mx [0.. 2] ,  

m bx [0] 
≠ and ab does not occer in 
-mx 
[0.. 2] 

 
preprocessing phase 
Consider text= ATTGCCTAATA and 
pattern=CTAAG The alphabet of pattern is 
{A.C.G.T }; The sign “ * ” denotes a word of 
text which never appears in pattern. 
First, we fill in the blanks with the length m of 
pattern. 
Example: 
A  C  G  T  *  A  5  5  5  5  5  C  5  5  5  5  5  G  5 5 5 5 5 T 5 5 
5 5 5 * 5 5 5 5 5 

 
preprocessing phase 
Then, we suppose the last 2-substring ab does 
not occur in [0..m-2]. If P 

= b, we set ztBc[i , b] = m-1 for all i. 


Example: 
A  C  G  T  *  A  5  4  5  5  5  C  5  4  5  5  5  G  5 4 5 5 5 T 5 4 
5 5 5 * 5 4 5 5 5 
← b 
T: ATTGCCTAAGTA P: CTAAG 
CTAAG 
↑ 


 
preprocessing phase 
Finally, we set ztBC[a,b] = k if k≤ m-2 and 
P[m-k-2..m-k-1]=ab and ab does not occur in 
P[m-k-1..m-2]. 
Example: 
A C G T * A 1 4 5 5 5 C 5 4 5 3 5 G 5 4 5 5 5 
2 T 2 4 5 
5 5 
3 * 5 4 5 5 

↑ 

← b 
P: CTAAG 

10 
 
Case 1 : 
⇒ If ztBc[A,C] = k 
⇔ 
-mk ≤ 2 and x[ -k-m-k-m 2.. 1] = ab and 
ab does not occur [in -m-k-mx 1.. 2]. • Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C A G A G A G T A T A C A G T A C G 

Pattern 
G C A G A G A G 
G C A G A G A G 
ztBc A C G * 
A 8 8 2 8 
C 5 8 7 8 
G 1 6 7 8 
* 8 8 7 8 ↑ 


Shift by 5 
← b 
i 0 1 2 3 4 5 6 7 
x[i] G C A G A G A G 

• ztBc[C,A] = 5 ; k ≤ m-2 ; ∵ x[8-5-2..8-5-1] = ab 


(x[1..2] = CA) and “CA” does not occur in 
x[8-5-1..8-2] (x[2..6] ). 
11 
 
Case 2 : 
=> If ztBc[A,C] = k 
⇔ 
-mk = 1 
[0]; bx = and ab does not occur in 
-mx [0.. 2] , 
• Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C G G A G A G T A T A C A G T A C G 

Pattern 
G C A G A G A G 
G C A G A G A G 
ztBc A C G * 
A 8 8 2 8 
C 5 8 7 8 
G 1 6 7 8 
* 8 8 7 8 ↑ 


Shift by 7 
← b 
i 0 1 2 3 4 5 6 7 
x[i] G C A G A G A G 

•ztBc[C,G] = 7 ; k = m-1 ; ∵ x[0] = b ( G = G) and 


“CG” does not occur in x[0..8-2] (x[0..6] ). 
12 
 
Case 3 : 
=> If ztBc[A,C] = k 
⇔ 
k = bxm ; 
[0] ≠ and ab does not occer in 
-mx [0.. 2] . • 
Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C A G A G A G T A T A C A G T A C G 

Pattern 
G C A G A G A G 
ztBc A C G * 
A 8 8 2 8 
C 5 8 7 8 
G 1 6 7 8 
* 8 8 7 8 

↑ 

i 0 1 2 3 4 5 6 7 
x[i] G C A G A G A G ← b 

•ztBc[A,C] = 8 ; k = m ; ∵ x[0] ≠b (G≠C) and 


“AC” does not occur in x[0..8-2] ( x[0..6] ). 
13 
 
• Full Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C A G A G A G T A T A C A G T A C G 

Pattern 
G C A G A G A G 
Shift by 5 
G C A G A G A G 

In the step, we select = 5 > bmGs [7] =1. the ztBc 


function The pattern shifts to shift because 5 steps right 
by ztBc[P case 1. 6 
ztBc A C G * 
A 8 8 2 8 
C 5 8 7 8 
G 1 6 7 8 
* 8 8 7 8 ↑ 



=CA] 
← 
b i 0 1 2 3 4 5 6 7 
x[i] G C A G A G A G 
bmGs 7 7 7 2 7 4 7 1 
14 
 
• Full Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C A G A G A G T A T A C A G T A C G 

Pattern 
G C A G A G A G 

exact matching 
Shift by 7 
G C A G A G A G 

In the step, we select the bmGs function to shift because 


ztBc[A,G] = 2 < bmGs [0] = 7. 
ztBc A C G * 
A 8 8 2 8 
C 5 8 7 8 
G 1 6 7 8 
* 8 8 7 8 ↑ 


← b 
i 0 1 2 3 4 5 6 7 
x[i] G C A G A G A G 
bmGs 7 7 7 2 7 4 7 1 
15 
 
• Full Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C A G A G A G T A T A C A G T A C G 

Pattern 
G C A G A G A G 
Shift by 4 

In the step, we select the bmGs function to shift because 


ztBc[A,G] = 2 < bmGs [5] = 4. 
ztBc A C G * 
A 8 8 2 8 
C 5 8 7 8 
G 1 6 7 8 
* 8 8 7 8 ↑ 


G C A G A G A G 

← b i 
0 1 2 3 4 5 6 7 
x[i] G C A G A G A G 
bmGs 7 7 7 2 7 4 7 1 
16 
 
• Full Example 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text 
G C A T C G C A G A G A G T A T A C A G T A C G 

Pattern 
ztBc A C G * 
A 8 8 2 8 
C 5 8 7 8 
G 1 6 7 8 
* 8 8 7 8 ↑ 


G C A G A G A G 

By the bmGs or ztBc function ; We can select the ztBc 


function or the bmGs function to shift because ztBc[C,G] 
= 7 = bmGs [6]. 
← b 
i 0 1 2 3 4 5 6 7 
x[i] G C A G A G A G 
bmGs 7 7 7 2 7 4 7 1 
17 
 
Time complexity 
• preprocessing phase in O(m 

σ 
2 ) 
time and space complexity. ➢( σ 
= the numbers of alphabet of the text ). 
• searching phase in O(m 

n) time complexity. 
18 
 
References 
1. ZHU, R.F. and TAKAOKA, T., 1987, On 
improving the 
average case of the Boyer-Moore string 
matching algorithm, Journal of Information 
Processing 10(3):173-177 . 
19 
 
Thank you for your 
attention. 
20 

You might also like