Suffix Tree and Suffix Array - Fin5
Suffix Tree and Suffix Array - Fin5
si
s
Exact Matching Problem
Find ‘ssi’ in ‘mississippi’
ssippi
si
s
Every leaf below this point
in the tree marks the starting
location of ‘ssi’ in ‘mississippi’.
(ie. ‘ssissippi’ and ‘ssippi’)
Exact Matching Problem
Find ‘sissy’ in ‘mississippi’
Exact Matching Problem
Find ‘sissy’ in ‘mississippi’
Exact Matching Problem
Find ‘sissy’ in ‘mississippi’
s i
ss
Exact Matching Problem
Find ‘sissy’ in ‘mississippi’
s i
ss
Exact Matching Problem
So what? Knuth-Morris-Pratt and Boyer-Moore
both achieve this worst case bound.
O(m+n) when the text and pattern are presented
together.
Suffix trees are much faster when the text is
fixed and known first while the patterns vary.
O(m) for single time processing the text, then only O(n)
for each new pattern.
Aho-Corasick is faster for searching a number
of patterns at one time against a single text.
Boyer-Moore Algorithm
For string matching(exact matching
problem)
Time complexity O(m+n) for worst case
and O(n/m) for absense
Method: backward matching with 2
jumping arrays(bad character table and
good suffix table)
What are suffix arrays and trees?
• Text indexing data structures
• not word based
• allow search for patterns or
• computation of statistics
Important Properties
• Size
• Speed of exact matching
• Space required for construction
• Time required for construction
Suffix Tree
Properties of a Suffix Tree
Each tree edge is labeled by a substring of
S.
Each internal node has at least 2 children.
Each S has its corresponding labeled
(i)
path from root to a leaf, for 1 i n .
There are n leaves.
No edges branching out from the same
internal node can start with the same
character.
Building the Suffix Tree
How do we build a suffix tree?
while suffixes remain:
add next shortest suffix to the tree
Building the Suffix Tree
papua
Building the Suffix Tree
papua
papua
Building the Suffix Tree
papua
apua
papua
Building the Suffix Tree
papua
apua
p apua
ua
Building the Suffix Tree
papua
apua
p apua
ua
ua
Building the Suffix Tree
papua
a pua
p apua
ua
ua
Building the Suffix Tree
papua
a pua
p apua
ua
ua
Building the Suffix Tree
How do we build a suffix tree?
while suffixes remain:
add next shortest suffix to the tree
0 1 2 3 4 5 6 7 8 9 10
s =m i s s i s s i p p i
Step 1: SA≠ 0 = sort the suffixes starting at position
i ≠ 0 mod 3.
0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10
s = m i s s i s s i p p i $ $ m i s s i s s i p p i
Radix sort
3 3 2 1 5 5 4
1 4 7 10 2 5 8
Let S12 = [ 3 3 2 1 5 5 4 ]
=> SA≠0 = [ 10 7 4 1 8 5 2 ] in T(2n/3)
1 4 7 10 2 5 8
s12 =[ 3 3 2 1 5 5 4 ] s =m i s s i s s i p p i
s121 = 3 3 2 1 5 5 4 s1 = i s s i s s i p p i
s 12
4 = 3 2 1 5 5 4
s4 = i s s i p p i
s127 = 2 1 5 5 4
s1210 = 1 5 5 4 s7 = i p p i
s122 = 5 5 4 s10 = i
S125 = 5 4
s2 = s s i s s i p p i
s 12
8 = 4
s5= s s i p p i
s8 = p p i
SA ≠0
= [ 10 7 4 1 8 5 2 ],
It suffices to show that S12i < S12j <=> si < sj.
Compare Si and Sj where i = 0 , j ≠ 0 mod 3:
case 1: j = 1 mod 3
∵ i + 1 = 1 mod 3, j+1 = 2 mod 3
∴ compare (s[i], Si+1 ) with (s[j], Sj+1 )
in constant time.
case 2: j = 2 mod 3
∵ i + 2 = 2 mod 3, j+2 = 1 mod 3
∴ compare (s[i], s[i+1], Si+2) with
(s[j], s[j+1], Sj+2) in constant time
S12i < S12j <=> si < sj
Case 1: i = j mod 3
1 4 7 10 2 5 8 0 1 2 3 4 5 6 7 8 9 10 11 12
s12 = [ 3 3 2 1 5 5 4 ] s= m i s s i s s i p p i $ $
Ex:
4 7 10 2 5 8 4 5 6 7 8 9 10 11 12
s124 =[3 2 1 5 5 4] s4 = [ i s s i p p i $ $ ]
1 4 7 10 2 5 8 1 2 3 4 5 6 7 8 9 10 11 12
s121 = [ 3 3 2 1 5 5 4] s1 = [ i s s i s s i p p i $ $ ]
1 4 7 10 2 5 8 0 1 2 3 4 5 6 7 8 9 10 11 12
s12 = [ 3 3 2 1 5 5 4 ] s= m i s s i s s i p p i $ $
Ex:
4 7 10 2 5 8 4 5 6 7 8 9 10 11 12
s124 =[3 2 1 5 5 4] s4 = [ i s s i p p i $ $ ]
5 8 5 6 7 8 9 10
s125 = [ 5 4 ] s5 =[ s s i p p i ]
0 1 2 3 4 5 6 7 8 9 10
s =m i s s i s s i p p i
(s[i], Si+1 )
0: (m, ississippi) 9: (p, i) 0: (m, ississippi)
Step 1 Radix sort
3: (s, issippi) 6: (s, ippi) 9: (p, i)
6: (s, ippi) 3: (s, issippi) 6: (s, ippi)
9: (p, i) 0: (m, ississippi) 3: (s, issippi)
Step 3: SA = merge SA= 0 and SA≠ 0 .
= [10 7 4 1 0 9 8 6 3 5 2]
It is in time O(n) if we can determine the relative
order of Si SA= 0 and Sj SA≠0 in constant
time.
Time complexity analysis
ABAABBABBAC
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB
Search for leftmost occurence of:
ABAABBABBAC
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB > BA
ABAABBABBAC
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB = BB
ABAABBABBAC
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB > BA
ABAABBABBAC
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB = BA
ABAABBABBAC
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB < C
ABAABBABBAC
2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
leftmost position of BB is pointed to by SA[8]
rightmost position of BB is pointed to by SA[9]
=>All occurences of the pattern BB are pointed to by SA[8..9]
Important Properties
lcp(2, 3) = 4
Haim Kaplan's home page
Example http://www.math.tau.ac.il/~haimk/
Let S = mississippi
L 10 i
7 ippi
Let P = issa 4 issippi
1 ississippi
0 mississippi
M 9 pi
8 ppi
6 sippi
3 sisippi
5 ssippi
R 2 ssissippi
How do we accelerate the search ?
Maintain = lcp(P,L)
L
Maintain r = lcp(P,R)
R r
How do we accelerate the search ?
L
If > r then
s12 = [ 3 3 2 1 5 5 4 ]
SA12 = [ 3 2 1 0 6 5 4 ]
LCP12 = [ 0 0 1 0 0 1
0 ]
text
Advantage
They answer efficiently more complex queries.
Drawback
Costly construction process
The text must be readily available at query time
The results are not delivered in text position order.
NLP Laboratory of Hanshin University http://infocom.chonan.ac.kr/~limhs/
Compression
Suffix trees can be compressed almost to size of
suffix arrays
Suffix arrays can’t be compressed (almost
random), but can be constructed over
compressed text
instead of Huffman, use a code that respects alphabetic
order
almost the same compression
Signature files are sparse, so can be compressed
ratios up to 70%
Compression
Suffix trees and suffix arrays
Suffix arrays are very hard to compress further.
Because they represent an almost perfectly random
permutation of the pointers to the text.
Suffix arrays on compressed text
The main advantage is that both index construction and
querying almost double their performance.
• Construction is faster because more compressed text fits in the
same memory space and therefore fewer text blocks are needed.
• Searching is faster because a large part of the search time is
spent in disk seek operations over the text area to compare
suffixes.
Where have suffix trees been used?
Problems
linear-time longest common substring
constant-time least common ancestor
compression
inexact matching
Applications
The Human Genome Project (see Skiena)
motif discovery (see Arabidopsis genome
project)
PST – probabilistic suffix trees