0% found this document useful (0 votes)
19 views21 pages

09 SuffixTrees

This document discusses suffix trees, which are compressed tries representing all suffixes of a string or set of strings. Suffix trees allow for efficient exact string matching and finding the longest common substring between strings. While a naive algorithm to build a suffix tree takes O(n^2) time, more sophisticated algorithms can construct one in linear O(n) time. Suffix trees have applications in searching patterns against databases and finding all occurrences of a pattern in a text.

Uploaded by

khansara7744
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views21 pages

09 SuffixTrees

This document discusses suffix trees, which are compressed tries representing all suffixes of a string or set of strings. Suffix trees allow for efficient exact string matching and finding the longest common substring between strings. While a naive algorithm to build a suffix tree takes O(n^2) time, more sophisticated algorithms can construct one in linear O(n) time. Suffix trees have applications in searching patterns against databases and finding all occurrences of a pattern in a text.

Uploaded by

khansara7744
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Suffix Trees

Michael T. Goodrich
University of California, Irvine

Most slides adapted from http://www.cs.tau.ac.il/~bchor/CG/suffixtrees.ppt by Haim Kaplan


Trie
 A digital tree representing a set of strings.

c
{ a
aeef b
ad e
bbfe d b
bbfg
e
f
c }
f c
e g
Trie (Cont)
 Assume no string is a prefix of another

Each edge is labeled by a letter, c


a
no two edges outgoing from the same b
node are labeled the same.
e
Each string corresponds to a leaf. d b

The children of a node can be e


f
given in a list or a hash table
(indexed by characters from c
f e
the alphabet) g
Compressed Trie
 Compress unary nodes, label edges by substrings

a c 
c
a
b
e
d b bbf
d
e eef
f
f c c
e g e g

Children of each node can still be indexed by a character from the alphabet
(the first one in the substring)
Suffix tree
Given a string s a suffix tree of s is a
compressed trie of all suffixes of s

To make these suffixes prefix-free we add a


special character, say $, at the end of s

Mississippi -> Mississippi$


Suffix tree (Example)
Let s=abab, a suffix tree of s is a compressed
trie of all suffixes of s=abab$

{ $ (4,4)
$ a b
b$ b (0,1) (3,3)
ab$ $ (4,4)
a
bab$ a $ b (4,4)
abab$ b $ (2,4) (2,4)
$
}

O(n) space
Trivial algorithm to build a Suffix tree

a
Put the largest suffix in b
a
b
$

a b
Put the next largest (bab$) b a
a b
suffix in b $
$
a b
b a
a b
b $
$

Put the third suffix (ab$)


a b
in b a
b
$
a $
b
$
a b
b a
b
$
a $
b
$

Put the next largest suffix (b$) in a b


b
$
a
a $ b
b $
$
a b
b
$
a
a $ b
b $
$

Put the last suffix ($) in $


a b
b
$
a
a $ b
b $
$
$
a b
b
$
a
a $ b
b $
$

We also label each leaf with the starting point of the corresponding
suffix. $
a b 5
b
$
a
a $ b 4
b $
$ 3
2
1
Analysis
Naively, this takes O(n2) time to build in the worst
case.

More sophisticated algorithms can construct a suffix


tree in O(n) time… (to be continued).
The Naïve Algorithm in Practice
 The naïve construction algorithm is not usually
as bad as O(n2) time in practice.
 A worst-case example is an-1b$. This is rare.
 For example, for a random string, the naïve
algorithm runs in O(n log n) expected time.
 Why?
What can we do with it ?
Exact string matching:
Given a Text T, |T| = n, preprocess it such that
when a pattern P, |P|=m, arrives you can quickly
decide when it occurs in T.

We may also want to find all occurrences of P in T


Exact string matching
In preprocessing we just build a suffix tree in O(n) time
$
a b 5
b
$
a
a $ b 4
b $
$ 3
2
1

Given a pattern P = ab we traverse the tree according to the


pattern.
$
a b 5
b
$
a
a $ b 4
b $
$ 3
2
1

If we did not get stuck traversing the pattern then the pattern
occurs in the text.
Each leaf in the subtree below the node we reach corresponds
to an occurrence.

By traversing this subtree we get all k occurrences in O(m+k)


time
So what can we do with it ?
Matching a pattern against a database of
strings

1. Construct a suffix tree for the text


2. Search for each pattern in the suffix tree
Generalized suffix tree
Given a set of strings S a generalized suffix
tree of S is a compressed trie of all suffixes of
sS
To make these suffixes prefix-free we add a
special char, say $, at the end of s

To associate each suffix with a unique string


in S add a different special char to each s
Generalized suffix tree (Example)
Let s1=abab and s2=aab
here is a generalized suffix tree for s1 and s2
#
{ $
a b
$ # 5 4

b$ b# #
ab$ ab# b a a $ 3
bab$ aab# b b 4
# $
abab$
a $ # 1 2
} b
$ 3 2
1
Longest common substring (of two strings)
Every node with a leaf
descendant from string s1 and a #
$
leaf descendant from string s2 a b 5 4
represents a maximal common #
substring and vice versa. $
b a a 3
b b 4
Find such node with # $
largest “string depth” a $ # 1 2
b
$ 3 2
1
Longest Substring that is a Palindrome

Let s = cbaaba$ then sr = abaabc#

a #
b c $
7 7
$
a

b a
b

6 c b
c

c # a 6
a $ a
#

a
# $
4
b a
3 3 a 5 5 b
$ a c
b a
c 4 $ # $
2 1
2
1
#

You might also like