Tutorial Suffix Tree
Tutorial Suffix Tree
Text Indexing
String Matching problem: Given a text T and a pattern P, how to locate all occurrences of P in T ? KMP algorithm can solve this in O(|T|+|P|) time optimal In some applications, T is very long, and given in advance, and we will search different patterns against it later E.g., T= Human DNA, P = gene
3
Text Indexing
Text Indexing problem: Suppose a text T is known. Can we build a data structure for T, such that for any pattern P given later, we can find all occurrences of P in T quickly ? The data structure is called an index of T Target: search better than O(|T|+|P|) ??
4
Text Indexing
Two main kinds of text indexes: Word-Based: (for texts formed by words) Used by most text search engine E.g., Inverted Files Full-Text: (for texts with no word boundaries) Used in indexing DNA E.g., Suffix Tree, Suffix Array
5
Suffix Tree
Let T[1..n] be a text with n characters we assume T[n] is a unique character For any j, T[j..n] is called a suffix of T T has exactly n suffixes Weiner (1973) and McCreight (1976) independently invented the suffix tree a tree formed by putting all suffixes of T together
6
# 8 c a # 6 c # 3 a a c a
c # 7 a c # 4 c a a c # 1
7
a ca a
# 5
c # 2
Space Usage
There are O(n) nodes and O(n) edges in the suffix tree O(n) space ? Each edge needs to store its label, which can contain O(n) chars In the worst-case, total O(n2) chars Can we reduce space usage?
11
Space Usage
Observation: Each edge label must be equal to some substring of T Clever Idea: 1. Store T, and 2. Replace each edge label by 2 integers, telling which substring it is equal to Total space: O(n)
12
[4,8] 1
13
Suffix Array
Although suffix tree takes O(n) space, the hidden constant is quite large around 40n to 60n bytes Manber and Myers (1990) simplified the suffix tree, and invented the suffix array An array storing the suffixes of T in the dictionary order
14
Suffix Array
Suffix Array of acacaac# 1 2 3 4 5 6 7 8 # aac# ac# acaac# acacaac# c# caac# cacaac#
The suffix array SA for T has n entries For any j, SA[j] stores the jth smallest suffix, based on alphabetical order Theorem: If P occurs in T, its occurrences correspond to consecutive region in SA
15
Suffix Array
Suffix Array of acacaac# 1 2 3 4 5 6 7 8 # aac# ac# acaac# acacaac# c# caac# cacaac#
Searching P takes O(|P| log n) time using binary search Space: We can represent each suffix by its starting position O(n) space In practice, around 14n bytes
16