0% found this document useful (0 votes)
43 views33 pages

Suffix Trees and Suffix Arrays

Suffix trees and suffix arrays allow for efficient string matching by representing all suffixes of a string in a compressed trie structure. They enable finding all occurrences of a pattern in linear time after preprocessing the text in linear time to build the suffix tree. Generalized suffix trees can represent all suffixes of a collection of strings and allow applications like longest common substring queries. Maximal palindromes in a string can be found using the lowest common ancestor queries on a generalized suffix tree of the string and its reversal.

Uploaded by

nik4u
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views33 pages

Suffix Trees and Suffix Arrays

Suffix trees and suffix arrays allow for efficient string matching by representing all suffixes of a string in a compressed trie structure. They enable finding all occurrences of a pattern in linear time after preprocessing the text in linear time to build the suffix tree. Generalized suffix trees can represent all suffixes of a collection of strings and allow applications like longest common substring queries. Maximal palindromes in a string can be found using the lowest common ancestor queries on a generalized suffix tree of the string and its reversal.

Uploaded by

nik4u
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

Suffix trees and suffix arrays

Trie
• A tree representing a set of strings.

a c
{
aeef b
ad e
bbfe d b
bbfg
e
f
c }
f c
e g
Trie (Cont)
• Assume no string is a prefix of another

Each edge is labeled by a letter, a c


no two edges outgoing from the same b
node are labeled the same.
e
Each string corresponds to a leaf. d b

e
f
f c
e g
Compressed Trie
• Compress unary nodes, label edges by strings

a c  c
a
b
e
d b bbf
d
e eef
f
f c c
e g e g
Suffix tree
Given a string s a suffix tree of s is a
compressed trie of all suffixes of s

To make these suffixes prefix-free we add a


special character, say $, at the end of s
Suffix tree (Example)
Let s=abab, a suffix tree of s is a compressed
trie of all suffixes of s=abab$

{ $
$ a b
b
b$ $
ab$ a
a $ b
bab$ b $
abab$ $
}
Trivial algorithm to build a Suffix tree

a
Put the largest suffix in b
a
b
$

a b
Put the suffix bab$ in b a
a b
b $
$
a b
b a
a b
b $
$

Put the suffix ab$ in


a b
b a
b
$
a $
b
$
a b
b a
b
$
a $
b
$

Put the suffix b$ in


a b
b
$
a
a $ b
b $
$
a b
b
$
a
a $ b
b $
$

Put the suffix $ in $


a b
b
$
a
a $ b
b $
$
$
a b
b
$
a
a $ b
b $
$

We will also label each leaf with the starting point of the corres.
suffix.
$
a b 5
b
$
a
a $ b 4
b $
$ 3
2
1
Analysis
Takes O(n2) time to build.

We will see how to do it in O(n) time


What can we do with it ?
Exact string matching:
Given a Text T, |T| = n, preprocess it
such that when a pattern P, |P|=m,
arrives you can quickly decide when it
occurs in T.

W e may also want to find all occurrences


of P in T
Exact string matching
In preprocessing we just build a suffix tree in O(n) time
$
a b 5
b
$
a
a $ b 4
b $
$ 3
2
1

Given a pattern P = ab we traverse the tree according to the


pattern.
$
a b 5
b
$
a
a $ b 4
b $
$ 3
2
1

If we did not get stuck traversing the pattern then the pattern
occurs in the text.
Each leaf in the subtree below the node we reach corresponds
to an occurrence.

By traversing this subtree we get all k occurrences in O(n+k)


time
Generalized suffix tree
Given a set of strings S a generalized suffix
tree of S is a compressed trie of all suffixes of
sS
To make these suffixes prefix-free we add a
special char, say $, at the end of s

To associate each suffix with a unique string


in S add a different special char to each s
Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized
suffix tree for s1 and s2
#
{ a $
$ # b 5 4
b$ b# #
$
ab$ ab# b a a 3
bab$ aab# b b 4
# $
abab$
a $ # 1 2
} b
$ 3 2
1
So what can we do with it ?
Matching a pattern against a database of
strings
Longest common substring (of two strings)
Every node with a leaf
descendant from string s1 and a #
$
leaf descendant from string s2 a b 4
5
represents a maximal common #
substring and vice versa. b a a $ 3
b b 4
Find such node with # $
largest “string depth” a $ # 1 2
b
$ 3 2
1
Lowest common ancetors
A lot more can be gained from the suffix tree
if we preprocess it so that we can answer
LCA queries on it
Why?
The LCA of two leaves represents the
longest common prefix (LCP) of these 2
suffixes
#
a $
b 5 4
#
b a a $ 3
b b 4
# $
a $ # 1 2
b
$ 3 2
1
Finding maximal palindromes
• A palindrome: caabaac, cbaabc
• Want to find all maximal palindromes in a string s

Let s = cbaaba

The maximal palindrome with center between i-1 and i is the LCP
of the suffix at position i of s and the suffix at position m-i+1 of sr
Maximal palindromes algorithm
Prepare a generalized suffix tree for
s = cbaaba$ and sr = abaabc#

For every i find the LCA of suffix i of s and


suffix m-i+1 of sr
Let s = cbaaba$ then sr = abaabc#

a #
b $
c 7
7
$
a

b a
b

6 c b
c

c # a 6
a $ a
#

a
# $
4
b a
a 5 5 b
3 3 $
b a c a
c 4 $ # $
2 1
2
# 1
Analysis
O(n) time to identify all palindromes
Drawbacks
• Suffix trees consume a lot of space

• It is O(n) but the constant is quite big

• Notice that if we indeed want to traverse


an edge in O(1) time then we need an
array of ptrs. of size |Σ| in each node
Suffix array
• We loose some of the functionality but we
save space.

Let s = abab
Sort the suffixes lexicographically:
ab, abab, b, bab
The suffix array gives the indices of the
suffixes in sorted order
3 1 4 2
How do we build it ?
• Build a suffix tree
• Traverse the tree in DFS, lexicographically
picking edges outgoing from each node
and fill the suffix array.

• O(n) time
How do we search for a pattern ?
• If P occurs in T then all its occurrences are
consecutive in the suffix array.

• Do a binary search on the suffix array

• Takes O(mlogn) time


Example
Let S = mississippi
L 11 i
8 ippi
Let P = issa 5 issippi
2 ississippi
1 mississippi
M 10 pi
9 ppi
7 sippi
4 sisippi
6 ssippi
R 3 ssissippi
How do we accelerate the search ?

Maintain  = LCP(P,L) 
L
Maintain r = LCP(P,R)

If = r then start comparing M


to P at  + 1

R r
How do we accelerate the search ?


L

If > r then

Suppose we know LCP(L,M)


M
If LCP(L,M) < we go left
If LCP(L,M) >  we go right
If LCP(L,M) =  we start
comparing at + 1 R r
Analysis of the acceleration
If we do more than a single comparison in an
iteration then max(, r ) grows by 1 for each
comparison  O(logn + m) time

You might also like