0% found this document useful (0 votes)

23 views94 pages

Suffix Tree and Suffix Array - Fin5

The document discusses suffix trees and suffix arrays. It provides motivation for using these data structures for efficient text searching. It then describes the exact string matching problem and illustrates how suffix trees can solve this problem in linear time. Finally, it covers how to build a suffix tree in linear time using Ukkonen's algorithm and how to build a suffix array, including using a suffix tree and direct construction approaches.

Uploaded by

Shagun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views94 pages

Suffix Tree and Suffix Array - Fin5

Uploaded by

Shagun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 94

Suffix Tree and Suffix Array

R92922025 Brain Chen

R92548028 Pluto Chang

Outline
 Motivation
 Exact Matching Problem
 Suffix Tree
 Building issues
 Suffix Array
 Build
 Search
 Longest common prefixes
 Extra topics discussion
 Suffix Tree VS. Suffix Array
Motivation
•Text search
•Need fast searching algorithm(with low space
cost)
•DNA sequences and protein sequences are too
large to search by traditional algorithms
•Some improved algorithms perform efficiently
•KMP, BM algorithms for string matching
•Suffix Tree with linear construction and
searching time
•Suffix Array with Suffix Tree based construction
Exact Matching Problem
 Find ‘ssi’ in ‘mississippi’

poulin at cs_ualberta_ca http://www.cs.ualberta.ca/~poulin/

Exact Matching Problem
 Find ‘ssi’ in ‘mississippi’
Exact Matching Problem
 Find ‘ssi’ in ‘mississippi’

si
s
Exact Matching Problem
 Find ‘ssi’ in ‘mississippi’

ssippi
si
s
Every leaf below this point
in the tree marks the starting
location of ‘ssi’ in ‘mississippi’.
(ie. ‘ssissippi’ and ‘ssippi’)
Exact Matching Problem
 Find ‘sissy’ in ‘mississippi’
Exact Matching Problem
 Find ‘sissy’ in ‘mississippi’
Exact Matching Problem
 Find ‘sissy’ in ‘mississippi’

s i
ss
Exact Matching Problem
 Find ‘sissy’ in ‘mississippi’

s i
ss
Exact Matching Problem
 So what? Knuth-Morris-Pratt and Boyer-Moore
both achieve this worst case bound.
 O(m+n) when the text and pattern are presented
together.
 Suffix trees are much faster when the text is
fixed and known first while the patterns vary.
 O(m) for single time processing the text, then only O(n)
for each new pattern.
 Aho-Corasick is faster for searching a number
of patterns at one time against a single text.
Boyer-Moore Algorithm
 For string matching(exact matching
problem)
 Time complexity O(m+n) for worst case
and O(n/m) for absense
 Method: backward matching with 2
jumping arrays(bad character table and
good suffix table)
What are suffix arrays and trees?
• Text indexing data structures
• not word based
• allow search for patterns or
• computation of statistics

Important Properties
• Size
• Speed of exact matching
• Space required for construction
• Time required for construction
Suffix Tree
Properties of a Suffix Tree
 Each tree edge is labeled by a substring of
S.
 Each internal node has at least 2 children.
 Each S has its corresponding labeled
(i)
path from root to a leaf, for 1 i  n .
 There are n leaves.
 No edges branching out from the same
internal node can start with the same
character.
Building the Suffix Tree
 How do we build a suffix tree?
while suffixes remain:
add next shortest suffix to the tree
Building the Suffix Tree
 papua
Building the Suffix Tree
 papua

papua
Building the Suffix Tree
 papua

apua
papua
Building the Suffix Tree
 papua

apua
p apua

ua
Building the Suffix Tree
 papua

apua
p apua

ua
ua
Building the Suffix Tree
 papua

a pua
p apua

ua
ua
Building the Suffix Tree
 papua

a pua
p apua

ua
ua
Building the Suffix Tree
 How do we build a suffix tree?
while suffixes remain:
add next shortest suffix to the tree

Naïve method - O(m2) (m = text size)

Building the Suffix Tree in O(m) Time
 In the previous example, we assumed that
the tree can be built in O(m) time.
 Weiner showed original O(m) algorithm
(Knuth is claimed to have called it “the
algorithm of 1973”)
 More space efficient algorithm by
McCreight in 1976
 Simpler ‘on-line’ algorithm by Ukkonen in
1995
Ukkonen’s Algorithm
 Build suffix tree T for string S[1..m]
 Build the tree in m phases, one for each
character. At the end of phase i, we will have
tree Ti, which is the tree representing the prefix
S[1..i].
 In each phase i, we have i extensions, one for each
character in the current prefix. At the end of
extension j, we will have ensured that S[j..i] is in the
tree Ti.

NTHU Make Lab http://make.cs.nthu.edu.tw

Ukkonen’s Algorithm
 3 possible ways to extend S[j..i] with character
i+1.
1. S[j..i] ends at a leaf. Add the character i+1 to
the end of the leaf edge.
2. There is a path through S[j..i], but no match for
the i+1 character. Split the edge and create a
new node if necessary, then add a new leaf
with character i+1.
3. There is already a path through S[j..i+1]. Do
nothing.
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm - mississippi
Ukkonen’s Algorithm
 In the form just presented, this is an O(m3)
time, O(m2) space algorithm.
 We need a few implementation speed-ups
to achieve the O(m) time and O(m) space
bounds.
Suffix Array
The Suffix Array

Definition: Given a string D the suffix

array SA for this string is
the sorted list of pointers to
all suffixes of D.

(Manber, Myers 1990)

中山大學資工系 -- 楊昌彪教授

The Suffix Array http://par.cse.nsysu.edu.tw/~cbyang/

 In a suffix array, all suffixes of S are in the

non-decreasing lexical order.
 For example, S=“ATCACATCATCA”
i 0 1 2 3 4 5 6 7 8 9 10 11
A 11 3 8 0 5 10 2 7 4 9 1 6
3 ATCACATCATCA S(0) 0 A S(11)
10 TCACATCATCA S(1) 1 ACATCATCA S(3)
6 CACATCATCA S(2) 2 ATCA S(8)
1 ACATCATCA S(3) 3 ATCACATCATCA S(0)
8 CATCATCA S(4) 4 ATCATCA S(5)
4 ATCATCA S(5) 5 CA S(10)
11 TCATCA S(6) 6 CACATCATCA S(2)
7 CATCA S(7) 7 CATCA S(7)
2 ATCA S(8) 8 CATCATCA S(4)
9 TCA S(9) 9 TCA S(9)
5 CA S(10) 10 TCACATCATCA S(1)
0 A S(11) 11 TCATCA S(6)
fin
How do we build it ?
 Build a suffix tree
 Traverse the tree in DFS, lexicographically
picking edges outgoing from each node
and fill the suffix array.
 O(n) time
 Suffix tree construction loses some of the
advantage that the suffix array has over
the suffix tree
Direct suffix array construction algorithm

 Unfortunately, it is difficult to solve this

problem with the suffix array Pos alone
because Pos has lost the information on
tree topology. In direct algorithm, the array
Height (saving lcp information) has the
information on the tree topology which is
lost in the suffix array P

“Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and

Its Applications”
Skew-algorithm
 Step 1:
SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3.
 Step 2:
SA= 0 = sort the suffixes starting at position i = 0 mod 3.
 Step 3:
SA = merge SA= 0 and SA≠ 0 .

0 1 2 3 4 5 6 7 8 9 10
s =m i s s i s s i p p i
Step 1: SA≠ 0 = sort the suffixes starting at position
i ≠ 0 mod 3.

0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 4 5 6 7 8 9 10
s = m i s s i s s i p p i $ $ m i s s i s s i p p i

Radix sort

3 3 2 1 5 5 4

1 4 7 10 2 5 8
Let S12 = [ 3 3 2 1 5 5 4 ]
=> SA≠0 = [ 10 7 4 1 8 5 2 ] in T(2n/3)
1 4 7 10 2 5 8
s12 =[ 3 3 2 1 5 5 4 ] s =m i s s i s s i p p i
s121 = 3 3 2 1 5 5 4 s1 = i s s i s s i p p i
s 12
4 = 3 2 1 5 5 4
s4 = i s s i p p i
s127 = 2 1 5 5 4
s1210 = 1 5 5 4 s7 = i p p i
s122 = 5 5 4 s10 = i
S125 = 5 4
s2 = s s i s s i p p i
s 12
8 = 4
s5= s s i p p i
s8 = p p i
SA ≠0
= [ 10 7 4 1 8 5 2 ],
It suffices to show that S12i < S12j <=> si < sj.
Compare Si and Sj where i = 0 , j ≠ 0 mod 3:

case 1: j = 1 mod 3
∵ i + 1 = 1 mod 3, j+1 = 2 mod 3
∴ compare (s[i], Si+1 ) with (s[j], Sj+1 )
in constant time.

case 2: j = 2 mod 3
∵ i + 2 = 2 mod 3, j+2 = 1 mod 3
∴ compare (s[i], s[i+1], Si+2) with
(s[j], s[j+1], Sj+2) in constant time
S12i < S12j <=> si < sj
Case 1: i = j mod 3

1 4 7 10 2 5 8 0 1 2 3 4 5 6 7 8 9 10 11 12
s12 = [ 3 3 2 1 5 5 4 ] s= m i s s i s s i p p i $ $

Ex:
4 7 10 2 5 8 4 5 6 7 8 9 10 11 12
s124 =[3 2 1 5 5 4] s4 = [ i s s i p p i $ $ ]

1 4 7 10 2 5 8 1 2 3 4 5 6 7 8 9 10 11 12
s121 = [ 3 3 2 1 5 5 4] s1 = [ i s s i s s i p p i $ $ ]

s124 < s121 s4 < s1

S12i < S12j <=> si < sj
Case 2: i ≠ j mod 3

1 4 7 10 2 5 8 0 1 2 3 4 5 6 7 8 9 10 11 12
s12 = [ 3 3 2 1 5 5 4 ] s= m i s s i s s i p p i $ $

Ex:
4 7 10 2 5 8 4 5 6 7 8 9 10 11 12
s124 =[3 2 1 5 5 4] s4 = [ i s s i p p i $ $ ]

5 8 5 6 7 8 9 10
s125 = [ 5 4 ] s5 =[ s s i p p i ]

s124 < s125 s4 < s5

Step 2: SA= 0 = sort the suffixes starting at position i
= 0 mod 3.

 The rank of sj among {sk | k ≠ 0 mod 3 } was determined in

Step1 for all j ≠ 0 mod 3.
 SA=0 = radix sort { (s[i], Si+1 ) | i = 0 mod 3 }.

0 1 2 3 4 5 6 7 8 9 10
s =m i s s i s s i p p i

(s[i], Si+1 )
0: (m, ississippi) 9: (p, i) 0: (m, ississippi)
Step 1 Radix sort
3: (s, issippi) 6: (s, ippi) 9: (p, i)
6: (s, ippi) 3: (s, issippi) 6: (s, ippi)
9: (p, i) 0: (m, ississippi) 3: (s, issippi)
Step 3: SA = merge SA= 0 and SA≠ 0 .

 SA= 0 = [s0 s9 s6 s3]

 SA≠0 = [s10 s7 s4 s1 s8 s5 s2]
 SA = merge SA= 0 and SA≠0
=[s10 s7 s4 s1 s0 s9 s8 s6 s3 s5 s2]

= [10 7 4 1 0 9 8 6 3 5 2]
It is in time O(n) if we can determine the relative
order of Si  SA= 0 and Sj  SA≠0 in constant
time.
Time complexity analysis

 Step1: O(n) + T(2n/3)

 Step2: O(n)
 Step3: O(n)
 T(n) = O(n) + T(2n/3) = O(n)
Exact matching using a Suffix Array
ABAABBABBAC

SUFFIX ARRAY SA:

SA = 2 0 3 6 9 1 5 8 4 7 10
Basic Idea: 2 binary searches in SA
Search for leftmost position
Search for rightmost position
BB
Search for leftmost occurence of:

ABAABBABBAC

2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB
Search for leftmost occurence of:

ABAABBABBAC

2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB > BA

Continue binary search in the right (larger) half of SA

BB
Search for leftmost occurence of:

ABAABBABBAC

2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB = BB

More occurences of BB left of this one possible!

BB
Search for leftmost occurence of:

ABAABBABBAC

2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB > BA

leftmost position of BB is pointed to by SA[8]

BB
Search for rightmost occurence of:

ABAABBABBAC

2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB = BA

More occurences of BB right of this one possible!

BB
Search for rightmost occurence of:

ABAABBABBAC

2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
BB < C

rightmost position of BB is pointed to by SA[9]

BB
Results of search for:

ABAABBABBAC

2 0 3 6 9 1 5 8 4 7 10
0 1 2 3 4 5 6 7 8 9 10
leftmost position of BB is pointed to by SA[8]
rightmost position of BB is pointed to by SA[9]
=>All occurences of the pattern BB are pointed to by SA[8..9]
Important Properties

for |SA| = n and m = length of pattern:

• Size : 1 Pointer per Letter (4 Byte if n < 4Gb)
• Speed of exact matching :
• O(log n) binary search steps
• # of compared chars is O(mlogn)
can be reduced to O(m + log n)
Longest common prefixes
 Definition: lcp(i,j) is the length of the
longest common prefix of the suffixes
beginning at SA[i] and SA[j].
 Mississippi Example
s =m i s s i ss i p p i
 SA[2] = 4 (issippi)
SA = [10 7 4 1 0 9 8 6 3 5 2]
 SA[3] = 1 (ississippi)

 lcp(2, 3) = 4
Haim Kaplan's home page
Example http://www.math.tau.ac.il/~haimk/

Let S = mississippi
L 10 i
7 ippi
Let P = issa 4 issippi
1 ississippi
0 mississippi
M 9 pi
8 ppi
6 sippi
3 sisippi
5 ssippi
R 2 ssissippi
How do we accelerate the search ?

Maintain  = lcp(P,L) 
L
Maintain r = lcp(P,R)

If = r then start comparing M

to P at  + 1

R r
How do we accelerate the search ?


L

If > r then

Suppose we know lcp(L,M)

M
If lcp(L,M) < we go left
If lcp(L,M) >  we go right
If lcp(L,M) =  we start
comparing at + 1 R r
Analysis of the acceleration

If we do more than a single comparison in an

iteration then max(, r ) grows by 1 for each
comparison  O(logn + m) time
Complicated Sorting Algorithm
 Using radix sort for each characters, totally
O(N2)
 Using radix sort for each H characters,
and for 2H, 4H, 8H etc. →O(NlogN)
Precomputed LCP Array Construction

 Compute lcps between suffixes that are

consecutive in the sorted Pos array:
 Range Minimum Query Theorem:
 lcp(A , APos[j]) = min(lcp(APos[k], APos[k+1]), k←[i,
Pos[i]
j-1]
 lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)
 Given H-bucket lcps, compute 2H-bucket
lcps
 still require too much time
Precomputed LCP Array Construction

 Using height(i) = lcp(APos[i-1], APos[i])

 Using Hgt[i] to record height(i) when it is
correct
 For b-th iteration
 if
height(i) ≤ (b-1)H and height(i) < bH, then
Hgt[i] = height(i)
 Otherwise, Hgt[i] = N+1 (undefined)
Precomputed LCP Array Construction

 Constructing interval tree

 O(N)-space height balanced tree structure that
records the minimum pairwise lcp over a
collection of intervals of the suffix array
 Compute min( Hgt[k] : k ← [i, j] )
 Takes O(log N) time
 overall O(NlogN) time
Linear Time Expected-case Variations

 Require additional O(N) structure

 Longest Repeated Substring
 2log|Σ|N+O(1)

 Sorting algorithm => O(N log log N)

 Linear Time Algorithm
 Perform RadixSort on T-symbols of each suffix
 Improve both sorting algorithm and lcp
computation
Constant Time lcp Construction
 LCP[i] = lcp(SA[i], SA[i+1])
 Lcp(i, j) = min
i<=k<jLCP[k]
 j = SA[i], k = SA[i+1]
 Case 1:
j mod 3 = 1, k mod 3 = 2 => adjacent
 j’ = (j-1)/3, k’ = (n+k-2)/3 => adjacent
 l = lcp12(j’, k’) = LCP12[SA12[j’]-1]
 LCP[i] = lcp(j, k) = 3l + lcp(j+3l, k+3l) <= 2
 Constant time
Constant Time lcp Construction
 Case 2:
 J mod 3 = 0, k mod 3 = 1 (or k mod 3 = 2)
 If s[j] ≠s[k], LCP[i] = 0
 Otherwise, LCP[i] = 1 + lcp(j+1, k+1) ← Case 1
 lcp(j+1, k+1) = 3l + lcp(j+1+3l, k+1+3l), if SA[j+1],
SA[k+1] are adjacent
 If not adjacent, perform range minimum query
 No suffix is involved in more that two lcp queries at the
top level of the extended skew algorithm
 Constant time
Linear Time lcp Construction
 LCP[i] = lcp(SA[i], SA[i+1])
 lcp(i, j) = min
i<=k<jLCP[k]
 j = SA[i], k = SA[i+1]
 Case 1:
j mod 3 = 1, k mod 3 = 2
 j’ = (j-1)/3, k’ = (n+k-2)/3 => adjacent in SA12
 l = lcp12(j’, k’) = LCP12[SA12[j’]]
 LCP[i] = lcp(j, k) = 3l + lcp(j+3l, k+3l) <= 2
 Constant time
Linear Time lcp Construction
0 1 2 3 4 5 6 7 8 9 0
m i s s i s s i p p i

s12 = [ 3 3 2 1 5 5 4 ]

SA12 = [ 3 2 1 0 6 5 4 ]
LCP12 = [ 0 0 1 0 0 1
0 ]

 LCP12 is used to decide triple-lcps ( groups

of lcps of 3 characters )
Linear Time lcp Construction
 To answer range minimum queries on
LCP12 needs O(n) time
 Lemma: No suffix is involved in more than
two lcp queries at the top level of the
extended skew algorithm
A suffix can be involved in lcp queries only with
its two lexicographically nearest neighbors that
have the same preceding character
Linear Time lcp Construction
 LCP12 construction algorithm
 LCP12 array is divided into blocks of size log(n)
 For each block [a, b], precompute and store the following data:
 For all i ← [a, b], Qi identifies all j ← [a, i] such that
LCP12[j] < mink ←[j+1, i] LCP12[k]
 For all i ← [a, b], the minimum values over the ranges [a, i] and [i, b]
 The minimum for all ranges that end just before or begin just after
[a, b] and contain exactly a power of two full blocks
 [i, j] is completely inside a block
 Its minimum can be found with the help of Qj in constant time
 [i, j] is covered with some ranges whose minimun is stored
 Its minimum is the smallest of those minima
Linear Time lcp Construction
 LCP[i] = lcp(j, k) = 3l + lcp(j+3l, k+3l) <= 2
 l represents the number of triple-lcps
 3l represents the number of characters of lcp
triples
 The rest is non-triple lcps, which have length at
most 2
 Applying character comparison, they can be
done in constant time (at most 2 comparisons)
 Computing LCP[i] is O(1) for case 1
Linear Time lcp Construction
 Case 2:
 J mod 3 = 0, k mod 3 = 1
 If s[j] ≠s[k], LCP[i] = 0
 Otherwise, LCP[i] = 1 + lcp(j+1, k+1) ← Case 1
 lcp(j+1, k+1) = 3l + lcp(j+1+3l, k+1+3l), if SA[j+1],
SA[k+1] are adjacent
 If not adjacent, perform range minimum query
 No suffix is involved in more that two lcp queries at the
top level of the extended skew algorithm
 Constant time
Applications of Suffix Trees and Suffix Arrays

 Exact String Match

 The Exact Set Matching Problem
 The problem of finding all occurrences from a set of
strings P in a text T, where the set is input all at once.
 The Substring Problem for a Database of
Patterns
 A set of strings, or a database, is first known and fixed.
Later sequence of strings will be presented and for each
presented string S, the algorithm must find all the strings
in the database containing S as a substring.
Applications of Suffix Trees and Suffix Arrays

 Longest Common Substring of Two Strings

 Recognizing DNA Contamination
 Common Substrings of More Than Two Strings
 Building a Smaller Directed Graph for Exact
Matching
 how to compress a suffix tree into a directed acyclic
graph(DAG) that can be used to solve the exact
matching problem (and others) in linear time but that
uses less space than the tree.
Applications of Suffix Trees and Suffix Arrays

 A Reverse Role for Suffix Trees, and Major

Space Reduction
 Define ms(i) to be the length of the longest substring of
T starting at position i that matches a substring
somewhere (but we don’t know where) in P. These
values are called the matching statistics.
 Space-Efficient Longest Common Substring
Algorithm
 All-Pairs Suffix-Prefix Matching
 Given two string Si and Sj, and suffix of Si that matches
a prefix of Sj is called a suffix-prefix match of Si,Sj.
Suffix Trees and Suffix Arrays
 Suffix
 Each position in the text is considered as a text suffix.
 A string that does from that text position to the end to the

text
 Advantage
 They answer efficiently more complex queries.
 Drawback
 Costly construction process
 The text must be readily available at query time
 The results are not delivered in text position order.
NLP Laboratory of Hanshin University http://infocom.chonan.ac.kr/~limhs/
Compression
 Suffix trees can be compressed almost to size of
suffix arrays
 Suffix arrays can’t be compressed (almost
random), but can be constructed over
compressed text
 instead of Huffman, use a code that respects alphabetic
order
 almost the same compression
 Signature files are sparse, so can be compressed
 ratios up to 70%
Compression
 Suffix trees and suffix arrays
 Suffix arrays are very hard to compress further.
 Because they represent an almost perfectly random
permutation of the pointers to the text.
 Suffix arrays on compressed text
 The main advantage is that both index construction and
querying almost double their performance.
• Construction is faster because more compressed text fits in the
same memory space and therefore fewer text blocks are needed.
• Searching is faster because a large part of the search time is
spent in disk seek operations over the text area to compare
suffixes.
Where have suffix trees been used?
 Problems
 linear-time longest common substring
 constant-time least common ancestor

 maximally repetitive structures

 all-pairs suffix-prefix matching

 compression

 inexact matching

 conversion to suffix arrays

poulin at cs_ualberta_ca http://www.cs.ualberta.ca/~poulin/

Where have suffix trees / arrays been used?

 Applications
 The Human Genome Project (see Skiena)
 motif discovery (see Arabidopsis genome
project)
 PST – probabilistic suffix trees

 SVM string kernels

 chromosome-level similarities and

rearrangements
When have suffix trees / arrays been used?

 When they solve your problem.

 When you need results fast!
 When you have memory to spare.
 …more caveats.
fin

String Objective Test For JAVA ICSE Class X
No ratings yet
String Objective Test For JAVA ICSE Class X
3 pages
ATCD Unit4
No ratings yet
ATCD Unit4
81 pages
Simplification of CFG: Presented To Presented by
100% (2)
Simplification of CFG: Presented To Presented by
12 pages
BKS Unit II-Syntax Directed Definitions New
No ratings yet
BKS Unit II-Syntax Directed Definitions New
35 pages
DAA - Unit IV - Space and Time Tradeoffs - Lecture Slides
No ratings yet
DAA - Unit IV - Space and Time Tradeoffs - Lecture Slides
41 pages
Atc Module-5 - TM
100% (1)
Atc Module-5 - TM
29 pages
ATC Notes Module 4
No ratings yet
ATC Notes Module 4
23 pages
Complier Design Lab
No ratings yet
Complier Design Lab
45 pages
Boyer Moore Algorithm: Idan Szpektor
100% (1)
Boyer Moore Algorithm: Idan Szpektor
48 pages
Chapter 5 - Pushdown Automata
No ratings yet
Chapter 5 - Pushdown Automata
22 pages
Lect - 1 - Introduction To Computation Theory
No ratings yet
Lect - 1 - Introduction To Computation Theory
20 pages
Hors Pool
No ratings yet
Hors Pool
16 pages
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
Compiler Construction Chapter 6
No ratings yet
Compiler Construction Chapter 6
111 pages
Chapter 3
No ratings yet
Chapter 3
180 pages
G5 Advanced String Algorithms Lecture (No Code)
No ratings yet
G5 Advanced String Algorithms Lecture (No Code)
136 pages
DAA Tutorials
100% (1)
DAA Tutorials
8 pages
02 Exact KMP Boyer - Moore
No ratings yet
02 Exact KMP Boyer - Moore
100 pages
G5 Advanced String Algorithms Lecture (With Code)
No ratings yet
G5 Advanced String Algorithms Lecture (With Code)
142 pages
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
No ratings yet
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
41 pages
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
No ratings yet
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
41 pages
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
No ratings yet
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
43 pages
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
No ratings yet
Practical File: Course Title: Basic Simulation Lab Course Code: ES-204 Credit Units:01
43 pages
Wallace Tree Multiplier Part1
No ratings yet
Wallace Tree Multiplier Part1
5 pages
CS3452 Theory of Computation Two Mark Questions 1
No ratings yet
CS3452 Theory of Computation Two Mark Questions 1
47 pages
Suffix Array
No ratings yet
Suffix Array
71 pages
Compiler Course File
No ratings yet
Compiler Course File
94 pages
Dynamic Programming: Assignment
No ratings yet
Dynamic Programming: Assignment
29 pages
Exp-4-Eliminating Ambiguity, Left Recursion and Left Factoring - 012
No ratings yet
Exp-4-Eliminating Ambiguity, Left Recursion and Left Factoring - 012
14 pages
Advanced String Lecture
No ratings yet
Advanced String Lecture
50 pages
Flat Problems
No ratings yet
Flat Problems
2 pages
Module 2 C D Notes
No ratings yet
Module 2 C D Notes
21 pages
Lecture Notes On Pattern Matching Algorithms
No ratings yet
Lecture Notes On Pattern Matching Algorithms
16 pages
Program No. - 3: Write A Program To Find Different Tokens in A Program
No ratings yet
Program No. - 3: Write A Program To Find Different Tokens in A Program
3 pages
5 - String and Matrix-Note Version
No ratings yet
5 - String and Matrix-Note Version
28 pages
Dynamic Programming - Longest Common Subsequence (LCS)
No ratings yet
Dynamic Programming - Longest Common Subsequence (LCS)
34 pages
Week4 PPT SM
No ratings yet
Week4 PPT SM
35 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
25 pages
6 CFG
No ratings yet
6 CFG
34 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
30 pages
Scs
No ratings yet
Scs
51 pages
Simple Sorting and Searching Algorithms Lecture Note
No ratings yet
Simple Sorting and Searching Algorithms Lecture Note
11 pages
A357460420 - 22393 - 2 - 2018 - String Matching
No ratings yet
A357460420 - 22393 - 2 - 2018 - String Matching
27 pages
Pseudo Random Interleaver
No ratings yet
Pseudo Random Interleaver
17 pages
54.string 2notes
No ratings yet
54.string 2notes
20 pages
5 1 Stringsearch
No ratings yet
5 1 Stringsearch
26 pages
Hors Pool
No ratings yet
Hors Pool
16 pages
S Code DEV Js
No ratings yet
S Code DEV Js
14 pages
KMP Algo
No ratings yet
KMP Algo
16 pages
Knuth Moris 2797348
No ratings yet
Knuth Moris 2797348
21 pages
DATS Univ Sol Summer 2014
No ratings yet
DATS Univ Sol Summer 2014
25 pages
Suffix
No ratings yet
Suffix
29 pages
Week - 3 3) First and Follow 3.1) Simulate First and Follow of A Grammar. Program
No ratings yet
Week - 3 3) First and Follow 3.1) Simulate First and Follow of A Grammar. Program
14 pages
Patterns
No ratings yet
Patterns
40 pages
W 9 Presentation
No ratings yet
W 9 Presentation
20 pages
3 LLK First and Follow
No ratings yet
3 LLK First and Follow
20 pages
Module-3 Lexical Analysis: System Software 15CS63
No ratings yet
Module-3 Lexical Analysis: System Software 15CS63
8 pages
W9 Presentation
No ratings yet
W9 Presentation
20 pages
Assignment 3
No ratings yet
Assignment 3
10 pages
LIPIcs CPM 2016 23
No ratings yet
LIPIcs CPM 2016 23
12 pages
Longest Common Subsequence
No ratings yet
Longest Common Subsequence
11 pages
Foundation of Algorithms Assignment 4 Chinmay Kulkarni (ck1166)
No ratings yet
Foundation of Algorithms Assignment 4 Chinmay Kulkarni (ck1166)
11 pages
Linear Suffix Array Construction by Almost Pure Induced-Sorting
No ratings yet
Linear Suffix Array Construction by Almost Pure Induced-Sorting
10 pages
Lec06 448
No ratings yet
Lec06 448
6 pages
LAB 1.m
No ratings yet
LAB 1.m
7 pages
Output:: Args Original Reverse in
No ratings yet
Output:: Args Original Reverse in
6 pages
Fin f12 Sol
No ratings yet
Fin f12 Sol
6 pages
Encryption Algorithms Explained
No ratings yet
Encryption Algorithms Explained
20 pages
Strings and Pattern Matching
No ratings yet
Strings and Pattern Matching
17 pages
20BCS5977 - DAA LAB WORKSHEET 3.3pdf
No ratings yet
20BCS5977 - DAA LAB WORKSHEET 3.3pdf
5 pages
L9 DynamicProgramming Part02 LCS ED
No ratings yet
L9 DynamicProgramming Part02 LCS ED
31 pages
CS 201 Data Structures - Hash Tables Tutorial 01
No ratings yet
CS 201 Data Structures - Hash Tables Tutorial 01
4 pages
AAD Lec11
No ratings yet
AAD Lec11
5 pages
Experiment 9 DAA
No ratings yet
Experiment 9 DAA
5 pages
Homework 4 Cs 3114 Do Not Use To Cheat!
No ratings yet
Homework 4 Cs 3114 Do Not Use To Cheat!
3 pages
Flat Apr 2023
No ratings yet
Flat Apr 2023
2 pages
HW 2
No ratings yet
HW 2
5 pages
Principles of Compiler Design 2001 Regulations CS337
No ratings yet
Principles of Compiler Design 2001 Regulations CS337
2 pages
CMP 479 Exam 2013-2014
No ratings yet
CMP 479 Exam 2013-2014
2 pages
3468 Compiler Construction
No ratings yet
3468 Compiler Construction
5 pages
Cse 7 Sem Language Processor Winter 2016
No ratings yet
Cse 7 Sem Language Processor Winter 2016
2 pages
Study of Lex and Yacc: Lexisa
No ratings yet
Study of Lex and Yacc: Lexisa
4 pages
Permutation and Combinations
From Everand
Permutation and Combinations
Ramesh Chandra
4/5 (36)
Modern Algebra Essentials
From Everand
Modern Algebra Essentials
Lufti A. Lutfiyya
No ratings yet
Music Theory
From Everand
Music Theory
Cyril Sumagaysay
No ratings yet
Faux Taxidermy Knits: 15 Wild Animal Knitting Patterns
From Everand
Faux Taxidermy Knits: 15 Wild Animal Knitting Patterns
Louise Walker
4/5 (2)
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
Ugly Animals Crochet Patterns Collection
From Everand
Ugly Animals Crochet Patterns Collection
Alina Owais
No ratings yet
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
From Everand
Application of Derivatives Tangents and Normals (Calculus) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Fox Crochet Pattern
From Everand
Fox Crochet Pattern
Alina Owais
No ratings yet

Suffix Tree and Suffix Array - Fin5

Uploaded by

Suffix Tree and Suffix Array - Fin5

Uploaded by

Suffix Tree and Suffix Array

R92922025 Brain Chen

R92548028 Pluto Chang

poulin at cs_ualberta_ca http://www.cs.ualberta.ca/~poulin/

Naïve method - O(m2) (m = text size)

NTHU Make Lab http://make.cs.nthu.edu.tw

Definition: Given a string D the suffix

(Manber, Myers 1990)

The Suffix Array http://par.cse.nsysu.edu.tw/~cbyang/

 In a suffix array, all suffixes of S are in the

 Unfortunately, it is difficult to solve this

“Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and

s124 < s121 s4 < s1

s124 < s125 s4 < s5

 The rank of sj among {sk | k ≠ 0 mod 3 } was determined in

 SA= 0 = [s0 s9 s6 s3]

 Step1: O(n) + T(2n/3)

SUFFIX ARRAY SA:

Continue binary search in the right (larger) half of SA

More occurences of BB left of this one possible!

leftmost position of BB is pointed to by SA[8]

More occurences of BB right of this one possible!

rightmost position of BB is pointed to by SA[9]

for |SA| = n and m = length of pattern:

If = r then start comparing M

Suppose we know lcp(L,M)

If we do more than a single comparison in an

 Compute lcps between suffixes that are

 Using height(i) = lcp(APos[i-1], APos[i])

 Constructing interval tree

 Require additional O(N) structure

 Sorting algorithm => O(N log log N)

 LCP12 is used to decide triple-lcps ( groups

 Exact String Match

 Longest Common Substring of Two Strings

 A Reverse Role for Suffix Trees, and Major

 maximally repetitive structures

 all-pairs suffix-prefix matching

 conversion to suffix arrays

poulin at cs_ualberta_ca http://www.cs.ualberta.ca/~poulin/

 SVM string kernels

 chromosome-level similarities and

 When they solve your problem.

You might also like