Lecture03_SuffixTree
Lecture03_SuffixTree
Advanced Data Structures Tries and Suffix Trees Dr. Amin Allam
[For more details, refer to “Jewels of Stringology” by Maxime Crochemore and Wojciech Rytter]
1 Tries
The trie data structures is a tree that stores several small strings (dataset), and allows to search for
(retrieve) a given (query) string inside the stored dataset. The following trie stores 5 strings:
0) AGA
1) AG T
A
2) AAG C
G
3) GAAG
4) TCG A G G
A
4
G
1 A
A
2
0
G
Insertions and retrievals start from the root. Each edge is labelled with one character. Edges from
a node to its children must be labelled with different characters. The ID of a dataset string is
contained in the node such that the path from the root to that node is labelled with that string (as
shown in the squares in the above figure).
Each insertion or retrieval traverses at most exactly m edges where m is the string length, thus
costing O(m) time assuming that O(1) time is needed to traverse from a node to its correct child
according to the edge label.
Suppose that the alphabet size (number of possible different characters) is |Σ|. A trie can be
implemented using one of the following methods:
• Each node contains an array of length |Σ|, whose ith element holds a child node pointer con-
nected by the ith character of the alphabet. Each node requires O(1) time and O(|Σ|) space.
• Each node contains a linked list where each element contains a character and a child node
pointer. Each node requires O(|Σ|) time and O(1) space.
• Each node contains a red-black tree where each element contains a character as the key, and a
child node pointer. Each node requires O(log(|Σ|)) time and O(1) space.
• One hash table for the whole trie, where each element contains a character and two node point-
ers: parent and child. The hash function is a function of the parent node pointer and the character.
Each node requires O(1) time and O(1) space, but this method suffers from cache misses.
1
FCAI-CU AdvDS Tries and Suffix Trees Amin Allam
2 Suffix tries
The suffix trie data structures is a trie that stores all suffixes of a given large string of length n. A
suffix of a string is a substring that ends at the last location (n − 1). The suffix ID is its starting
location inside the original string. The suffix trie allows to search for a given substring inside the
original string. The following suffix trie stores all suffixes of the string banana:
012345
banana
a b n
5
n a a
4
a n n
3
n a a
2
a n
1
a
0
The suffix trie requires O(n2 ) space and construction time, which makes it impractical. To make it
practical, nodes with one child should be removed. Before doing that, a sentinel $ is added to the
original string to make sure that no suffix ends at an internal node of the suffix trie:
0123456
banana$
$ ab n
6 $ n a a
5 a n $ n
$ n a 4 a
3 a n $
$ a 2
1 $
To search for a substring, the suffix trie is traversed from the root to a node. IDs associated with all
nodes in the subtree of the reached node are reported as locations of that substring. For example,
searching for an or ana results {3,1}, while searching for a results {5,3,1}.
Now, one-child nodes can be safely removed to make a suffix tree. Also, since all suffixes end
at leaves, suffix IDs can be removed and deduced after query traversal by subtracting number of
traversed characters from n.
2
FCAI-CU AdvDS Tries and Suffix Trees Amin Allam
3 Suffix trees
A suffix tree is a compact suffix trie which contains all suffixes of an original string of length n
(including $), does not contain any one-child node, and all suffixes end at leaves. Consider the
following suffix tree of the string banana$:
0123456
banana$
[6,1] $
a n
[1,1] a
6 [6,1] $ [2,2]
n b
5 [2,2] a a [6,1] $
n
[6,1] $ a 4
n
[0,7] n [4,3] a
a $
3
n $
[4,3] a
$
2
1
[st,len]
After one-child nodes are removed, some edges need to be labelled with substrings, not with single
characters as in the suffix trie. To avoid O(n2 ) space, edges are labelled with the start location and
the length of a substring inside the original string, instead of labelling them with the substrings
themselves. Thus, the original string must be available to conduct queries. Substrings are shown
on edges in the above figure only for illustration. The substring length can also be removed and
deduced by subtracting the smallest start location of children from the start location of parent.
Thus, each node in the suffix tree needs O(1) space, and the number of leaves equals to the number
of suffixes n. The number of internal nodes is ≤ n − 1 ∗ . Thus, the suffix tree needs O(n) space.
∗ The number of internal nodes of a tree with no one-child nodes ≤ number of leaves −1.
Proof: Consider a procedure which starts with leaves and attempts to construct arbitrary tree by
picking at least two nodes and creating a new internal node as their parent. After each step, the
number of nodes with no parent decreases by one. The procedure stops when there is exactly one
node with no parent (which is the root). The number of steps, as well as the number of created
internal nodes, cannot exceed the number of leaves −1.
To construct a suffix tree, we should not create an O(n2 ) suffix trie then use it to construct the suffix
tree, because O(n2 ) space or time is not available for large strings. Ukkonen proposed a practical
algorithm to construct the O(n) suffix tree using only O(n) space and time.
The time complexity of searching for a substring inside the suffix tree is O(m+occ) where m is the
length of the substring, and occ is the number of occurrences of that substring inside the original
string. That result follows because O(m) is needed as initial traversal, then O(occ) is needed for a
depth first search starting from the internal node or the place where we stopped.