0% found this document useful (0 votes)
93 views37 pages

Chapter 10: Sequence Mining

The document summarizes key concepts in sequence mining and frequent sequence mining algorithms. It defines basic terminology like sequences, subsequences, and support. It then describes two algorithms for mining frequent sequences from sequence databases: GSP, a level-wise, breadth-first search algorithm, and SPADE, a vertical mining approach that uses sequential joins on position lists. The document provides pseudocode for the GSP algorithm and explains how it extends sequences level-by-level to find all frequent sequences above a minimum support threshold.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views37 pages

Chapter 10: Sequence Mining

The document summarizes key concepts in sequence mining and frequent sequence mining algorithms. It defines basic terminology like sequences, subsequences, and support. It then describes two algorithms for mining frequent sequences from sequence databases: GSP, a level-wise, breadth-first search algorithm, and SPADE, a vertical mining approach that uses sequential joins on position lists. The document provides pseudocode for the GSP algorithm and explains how it extends sequences level-by-level to find all frequent sequences above a minimum support threshold.

Uploaded by

s8nd11d UNI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 10: Sequence Mining

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 1 / 37
Sequence Mining: Terminology

Let Σ be the alphabet, a set of symbols. A sequence or a string is defined as an


ordered list of symbols, and is written as s = s1 s2 . . . sk , where si ∈ Σ is a symbol
at position i, also denoted as s[i]. |s| = k denotes the length of the sequence.
The notation s[i : j] = si si +1 · · · sj −1 sj denotes the substring or sequence of
consecutive symbols in positions i through j, where j > i.
Define the pref ix of a sequence s as any substring of the form s[1 : i] = s1 s2 . . . si ,
with 0 ≤ i ≤ n.
Define the suff ix of s as any substring of the form s[i : n] = si si +1 . . . sn , with
1 ≤ i ≤ n + 1.
s[1 : 0] is the empty prefix, and s[n + 1 : n] is the empty suffix. Let Σ⋆ be the set
of all possible sequences that can be constructed using the symbols in Σ,
including the empty sequence ∅ (which has length zero).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 2 / 37
Sequence Mining: Terminology

Let s = s1 s2 . . . sn and r = r1 r2 . . . rm be two sequences over Σ. We say that r is a


subsequence of s denoted r ⊆ s, if there exists a one-to-one mapping
φ : [1, m] → [1, n], such that r [i] = s[φ(i)] and for any two positions i, j in r ,
i < j =⇒ φ(i) < φ(j). In If r ⊆ s, we also say that s contains r .
The sequence r is called a consecutive subsequence or substring of s provided
r1 r2 . . . rm = sj sj +1 . . . sj +m−1 , i.e., r [1 : m] = s[j : j + m − 1], with 1 ≤ j ≤ n − m + 1.

Let Σ = {A, C , G , T }, and let s = ACTGAACG .


Then r 1 = CGAAG is a subsequence of s, and r 2 = CTGA is a substring of s. The
sequence r 3 = ACT is a prefix of s, and so is r 4 = ACTGA, whereas r 5 = GAACG
is one of the suffixes of s.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 3 / 37
Frequent Sequences

Given a database D = {s 1 , s 2 , . . . , s N } of N sequences, and given some sequence


r , the support of r in the database D is defined as the total number of sequences
in D that contain r

sup(r ) = s i ∈ D|r ⊆ s i

The relative support of r is the fraction of sequences that contain r

rsup(r ) = sup(r )/N

Given a user-specified minsup threshold, we say that a sequence r is frequent in


database D if sup(r ) ≥ minsup. A frequent sequence is maximal if it is not a
subsequence of any other frequent sequence, and a frequent sequence is closed if
it is not a subsequence of any other frequent sequence with the same support.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 4 / 37
Mining Frequent Sequences

For sequence mining the order of the symbols matters, and thus we have to
consider all possible permutations of the symbols as the possible frequent
candidates. Contrast this with itemset mining, where we had only to consider
combinations of the items.
The sequence search space can be organized in a prefix search tree. The root of
the tree, at level 0, contains the empty sequence, with each symbol x ∈ Σ as one
of its children. As such, a node labeled with the sequence s = s1 s2 . . . sk at level k
has children of the form s ′ = s1 s2 . . . sk sk +1 at level k + 1. In other words, s is a
prefix of each child s ′ , which is also called an extension of s.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 5 / 37
Example Sequence Database

Id Sequence
s1 CAGAAGT
s2 TGACAG
s3 GAAGT

Using minsup = 3, the set of frequent subsequences is given as:

F (1) = A(3), G (3), T (3)


F (2) = AA(3), AG (3), GA(3), GG (3)
F (3) = AAG (3), GAA(3), GAG (3)
F (4) = GAAG (3)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 6 / 37
Level-wise Sequence Mining: GSP Algorithm

The GSP algorithm searches the sequence prefix tree using a level-wise or breadth-first
search. Given the set of frequent sequences at level k, we generate all possible sequence
extensions or candidates at level k + 1. We next compute the support of each candidate
and prune those that are not frequent. The search stops when no more frequent
extensions are possible.
The prefix search tree at level k is denoted C (k ) . Initially C (1) comprises all the symbols
in Σ. Given the current set of candidate k-sequences C (k ) , the method first computes
their support.
For each database sequence s i ∈ D, we check whether a candidate sequence r ∈ C (k ) is a
subsequence of s i . If so, we increment the support of r . Once the frequent sequences at
level k have been found, we generate the candidates for level k + 1.
For the extension, each leaf r a is extended with the last symbol of any other leaf r b that
shares the same prefix (i.e., has the same parent), to obtain the new candidate
(k + 1)-sequence r ab = r a + r b [k]. If the new candidate r ab contains any infrequent
k-sequence, we prune it.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 7 / 37
Algorithm GSP

GSP (D, Σ, minsup):


1 F ←∅
2 C (1) ← {∅} // Initial prefix tree with single symbols
3 foreach s ∈ Σ do Add s as child of ∅ in C (1) with sup(s) ← 0
4 k ← 1 // k denotes the level
5 while C (k ) 6= ∅ do
6 ComputeSupport (C (k ) , D)
7 foreach leaf s ∈ C (k ) do 
8 if sup(r ) ≥ minsup then F ← F ∪ (r , sup(r ))
9 else remove s from C (k )
10 C (k +1) ← ExtendPrefixTree (C (k ) )
11 k ← k +1
12 return F (k )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 8 / 37
Algorithm ComputeSupport
ComputeSupport (C (k ) , D):
1 foreach s i ∈ D do
2 foreach r ∈ C (k ) do
3 if r ⊆ s i then sup(r ) ← sup(r ) + 1

ExtendPrefixTree (C (k ) ):
1 foreach leaf r a ∈ C (k ) do
2 foreach leaf r b ∈ Children(Parent(r a )) do
3 r ab ← r a + r b [k] // extend r a with last item of r b
// prune if there are any infrequent subsequences
4 if r c ∈ C (k ) , for all r c ⊂ r ab , such that |r c | = |r ab | − 1 then
5 Add r ab as child of r a with sup(r ab ) ← 0
6 if no extensions from r a then
7 remove r a , and all ancestors of r a with no extensions, from
C (k )
8 return C (k )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 9 / 37
Sequence Search Space
shaded ovals are infrequent sequences

∅(3)

A(3) C (2) G (3) T (3)

AA(3) AG (3) AT (2) GA(3) GG (3) GT (2) TA(1) TG (1) TT (0)

AAA(1) AAG (3) AGA(1) AGG (1) GAA(3) GAG (3) GGA(0) GGG (0)

AAGG GAAA GAAG (3) GAGA GAGG

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 10 / 37
Vertical Sequence Mining: Spade

The Spade algorithm uses a vertical database representation for sequence mining.
For each symbol s ∈ Σ, we keep a set of tuples of the form hi, pos(s)i, where
pos(s) is the set of positions in the database sequence s i ∈ D where symbol s
appears.
Let L(s) denote the set of such sequence-position tuples for symbol s, which we
refer to as the poslist. The set of poslists for each symbol s ∈ Σ thus constitutes a
vertical representation of the input database.
Given k-sequence r , its poslist L(r ) maintains the list of positions for the
occurrences of the last symbol r [k] in each database sequence s i , provided r ⊆ s i .
The support of sequence r is simply the number of distinct sequences in which r
occurs, that is, sup(r ) = |L(r )|.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 11 / 37
Spade Algorithm

Support computation in Spade is done via sequential join operations.


Given the poslists for any two k-sequences r a and r b that share the same (k − 1)
length prefix, a sequential join on the poslists is used to compute the support for
the new (k + 1) length candidate sequence r ab = r a + r b [k].


Given

a tuple i, pos r b [k] ∈ L(r b ), we first check if there exists a tuple


i, pos r a [k] ∈ L(r a ), that is, both sequences must occur in the same database
sequence s i .

Next, for each position p ∈ pos r b [k] , we check whether there exists a position
q ∈ pos r a [k] such that q < p. If yes, this means that the symbol r b [k] occurs
after the last position of r a and thus we retain p as a valid occurrence of r ab . The
poslist L(r ab ) comprises all such valid occurrences.
We keep track of positions only for the last symbol in the candidate sequence
since we extend sequences from a common prefix, and so there is no need to keep
track of all the occurrences of the symbols in the prefix.
We denote the sequential join as L(r ab ) = L(r a ) ∩ L(r b ).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 12 / 37
Spade Algorithm

// Initial Call: F ← ∅, k ← 0,

P ← hs, L(s)i | s ∈ Σ, sup(s) ≥ minsup
Spade (P, minsup, F, k):
1 foreach r a ∈ P do
2 F ← F ∪ (r a , sup(r a ))
3 Pa ← ∅
4 foreach r b ∈ P do
5 r ab = r a + r b [k]
6 L(r ab ) = L(r a ) ∩ L(r b )
7 if sup(r ab ) ≥ minsup
 then
8 Pa ← Pa ∪ hr ab , L(r ab )i
9 if Pa 6= ∅ then Spade (Pa , minsup, F, k + 1)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 13 / 37
Sequence Mining via Spade

A G T
C
1 2,4,5 1 3,6 1 7
1 1
2 3,5 2 2,6 2 1
2 4
3 2,3 3 1,4 3 5

AA AG GA GG
AT GT
1 4,5 1 3,6 1 4,5 1 6 TA TG
1 7 1 7
2 5 2 6 2 3,5 2 6 2 3,5 2 2,6
3 5 3 5
3 3 3 4 3 2,3 3 4

AAG GAA GAG


AAA 1 6 AGA AGG 1 5 1 6
1 5 2 6 1 5 1 6 2 5 2 6
3 4 3 3 3 4

GAAG
1 6
2 6
3 4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 14 / 37
Projection-Based Sequence Mining: PrefixSpan

Let D denote a database, and let s ∈ Σ be any symbol. The projected database
with respect to s, denoted D s , is obtained by finding the first occurrence of s in
s i , say at position p. Next, we retain in D s only the suffix of s i starting at
position p + 1. Further, any infrequent symbols are removed from the suffix. This
is done for each sequence s i ∈ D.
PrefixSpan computes the support for only the individual symbols in the projected
database D s ; it then performs recursive projections on the frequent symbols in a
depth-first manner.
Given a frequent subsequence r , let D r be the projected dataset for r . Initially r
is empty and D r is the entire input dataset D. Given a database of (projected)
sequences D r , PrefixSpan first finds all the frequent symbols in the projected
dataset. For each such symbol s, we extend r by appending s to obtain the new
frequent subsequence r s . Next, we create the projected dataset D s by projecting
D r on symbol s. A recursive call to PrefixSpan is then made with r s and D s .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 15 / 37
PrefixSpan Algorithm

// Initial Call: D r ← D, r ← ∅, F ← ∅
PrefixSpan (D r , r , minsup, F):
1 foreach s ∈ Σ such that sup(s, D r ) ≥ minsup do
2 r s = r + s// extend r by symbol s
3 F ← F ∪ (r s , sup(s, D r ))
4 D s ← ∅ // create projected data for symbol s
5 foreach s i ∈ D r do
6 s ′i ← projection of s i w.r.t symbol s
7 Remove any infrequent symbols from s ′i
8 Add s ′i to D s if s ′i 6= ∅
9 if D s 6= ∅ then PrefixSpan (D s , r s , minsup, F)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 16 / 37
Projection-based Sequence Mining: PrefixSpan
D∅
s 1 CAGAAGT
s 2 TGACAG
s 3 GAAGT
A(3), C (2), G (3), T (3)

DA DG
s 1 GAAGT s 1 AAGT DT
s 2 AG s 2 AAG s 2 GAAG
s 3 AGT s 3 AAGT A(1), G (1)
A(3), G (3), T (2) A(3), G (3), T (2)

D AA D GA
s 1 AG D AG s 1 AG
s2 G s 1 AAG s 2 AG D GG
s3 G s 3 AG ∅
A(1), G (1)
A(1), G (3) A(3), G (3)

D GAA
s1 G
D AAG s2 G D GAG
∅ s3 G ∅
G (3)

D GAAG

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 17 / 37
Substring Mining via Suffix Trees

Let s be a sequence having length n, then there are at most O(n2 ) possible
distinct substrings contained in s. This is a much smaller search space compared
to subsequences, and consequently we can design more efficient algorithms for
solving the frequent substring mining task.

Naively, we can mine all the frequent substrings in worst case O(Nn2 ) time for a
dataset D = {s 1 , s 2 , . . . , s N } with N sequences.

We will show that all sequences can be mined in O(Nn) time via Suffix Trees.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 18 / 37
Suffix Tree

Given a sequence s, we append a terminal character $ 6∈ Σ so that


s = s1 s2 . . . sn sn+1 , where sn+1 = $, and the jth suffix of s is given as
s[j : n + 1] = sj sj +1 . . . sn+1 .
The suff ix tree of the sequences in the database D, denoted T , stores all the
suffixes for each s i ∈ D in a tree structure, where suffixes that share a common
prefix lie on the same path from the root of the tree.
The substring obtained by concatenating all the symbols from the root node to a
node v is called the node label of v , and is denoted as L(v ). The substring that
appears on an edge (va , vb ) is called an edge label, and is denoted as L(va , vb ).
A suffix tree has two kinds of nodes: internal and leaf nodes. An internal node in
the suffix tree (except for the root) has at least two children, where each edge
label to a child begins with a different symbol. Because the terminal character is
unique, there are as many leaves in the suffix tree as there are unique suffixes over
all the sequences. Each leaf node corresponds to a suffix shared by one or more
sequences in D.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 19 / 37
Suffix Tree Construction for s = CAGAAGT $
Insert each suffix j per step
CAGAAGT $

CAGAAGT $

CAGAAGT $
CAGA
AGAA

AG

GA

GA
AA

AG

AG
A
AGT $

GT
GT $

T
T

$
$
$
(1,1) (1,2) (1,1) (1,2) (1,1) (1,3) (1,1) (1,3)

(a) j = 1 (b) j = 2 (c) j = 3

GAAG
AGT $

T$
(1,4) (1,2)

(d) j = 4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 20 / 37
Suffix Tree Construction for s = CAGAAGT $
Insert each suffix j per step

CAGAAGT $
CAGAAGT $

CAGAAGT $
GA
AG

T$
A
A

G
G
T $

(1,1) (1,3) (1,1) (1,1) (1,7)

AAGT

AAGT
AGT $

AGT $

AGT $
T$

T$
G

G
$

$
(1,4) (1,4) (1,3) (1,6) (1,4) (1,3) (1,6)
AAGT

AAGT

AAGT
T$

T$

T$
$

$
(1,2) (1,5) (1,2) (1,5) (1,2) (1,5)

(e) j = 5 (f) j = 6 (g) j = 7

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 21 / 37
Suffix Tree for Entire Database
D = {s 1 = CAGAAGT , s 2 = TGACAG , s 3 = GAAGT }

CA
A
T

G
G
3 2 3 3

GAC
AAG
AG

CAG $

T$
T

$
A
$
G

AG $
$

T$

(1,4) (1,6) (1,7)


(2,3) 3 (1,1) (2,4) 3 (2,6) (2,1)
(3,2) AGT (3,4) (3,5)
AA

CAG
T$
GT

$
$

(1,5) (1,3)
(1,2) (2,5) (2,2)
(3,3) (3,1)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 22 / 37
Frequent Substrings

Once the suffix tree is built, we can compute all the frequent substrings by
checking how many different sequences appear in a leaf node or under an internal
node.
The node labels for the nodes with support at least minsup yield the set of
frequent substrings; all the prefixes of such node labels are also frequent.
The suffix tree can also support ad hoc queries for finding all the occurrences in
the database for any query substring q. For each symbol in q, we follow the path
from the root until all symbols in q have been seen, or until there is a mismatch
at any position. If q is found, then the set of leaves under that path is the list of
occurrences of the query q. On the other hand, if there is mismatch that means
the query does not occur in the database.
Because we have to match each character in q, we immediately get O(|q|) as the
time bound (assuming that |Σ| is a constant), which is independent of the size of
the database. Listing all the matches takes additional time, for a total time
complexity of O(|q| + k), if there are k matches.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 23 / 37
Ukkonen’s Linear Time Suffix Tree Algorithm

Achieving Linear Space: If an algorithm stores all the symbols on each edge
label, then the space complexity is O(n2 ), and we cannot achieve linear time
construction either.

The trick is to not explicitly store all the edge labels, but rather to use an
edge-compression technique, where we store only the starting and ending positions
of the edge label in the input string s. That is, if an edge label is given as s[i : j],
then we represent is as the interval [i, j].

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 24 / 37
Suffix Tree using Edge-compression: s = CAGAAGT $

v1

CAGAAGT $
A T$

G
[7,

[1, 8]
[2,
8]

[3,
2]
(1,1) (1,7)

3]
AAGT
AGT

T$
G
$

v2 (1,1) v4 (1,7)
(1,4) (1,3) (1,6)
AAGT

[5, 8

[4, 8]
T$

[7, 8]
[3, 3
$

]
(1,2) (1,5)

(a) Full Tree


(1,4) v3 (1,3) (1,6)

[4, 8]
[7, 8]

(1,2) (1,5)

(b) Compressed Tree


Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 25 / 37
Ukkonen Algorithm: Achieving Linear Time

Ukkonen’s method is an online algorithm, that is, given a string s = s1 s2 . . . sn $ it


constructs the full suffix tree in phases.
Phase i builds the tree up to the i-th symbol in s. Let Ti denote the suffix tree up
to the ith prefix s[1 : i], with 1 ≤ i ≤ n. Ukkonen’s algorithm constructs Ti from
Ti −1 , by making sure that all suffixes including the current character si are in the
new intermediate tree Ti .
In other words, in the ith phase, it inserts all the suffixes s[j : i] from j = 1 to j = i
into the tree Ti . Each such insertion is called the jth extension of the ith phase.
Once we process the terminal character at position n + 1 we obtain the final suffix
tree T for s.
However, this naive Ukkonen method has cubic time complexity because to obtain
Ti from Ti −1 takes O(i 2 ) time, with the last phase requiring O(n2 ) time. With n
phases, the total time is O(n3 ). We will show that this time can be reduced to
O(n).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 26 / 37
Algorithm NaiveUkkonen

NaiveUkkonen (s):
1 n ← |s|
2 s[n + 1] ← $ // append terminal character
3 T ← ∅ // add empty string as root
4 foreach i = 1, . . . , n + 1 do // phase i - construct Ti
5 foreach j = 1, . . . , i do // extension j for phase i
// Insert s[j : i] into the suffix tree
6 Find end of the path with label s[j : i − 1] in T
7 Insert si at end of path;
8 return T

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 27 / 37
Ukkonen’s Linear Time Algorithm: Implicit Suffixes

This optimization states that, in phase i, if the jth extension s[j : i] is found in the
tree, then any subsequent extensions will also be found, and consequently there is
no need to process further extensions in phase i.

Thus, the suffix tree Ti at the end of phase i has implicit suff ixes corresponding
to extensions j + 1 through i.

It is important to note that all suffixes will become explicit the first time we
encounter a new substring that does not already exist in the tree. This will surely
happen in phase n + 1 when we process the terminal character $, as it cannot
occur anywhere else in s (after all, $ 6∈ Σ).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 28 / 37
Ukkonen’s Algorithm: Implicit Extensions

Let the current phase be i, and let l ≤ i − 1 be the last explicit suffix in the
previous tree Ti −1 .

All explicit suffixes in Ti −1 have edge labels of the form [x, i − 1] leading to the
corresponding leaf nodes, where the starting position x is node specific, but the
ending position must be i − 1 because si −1 was added to the end of these paths in
phase i − 1.

In the current phase i, we would have to extend these paths by adding si at the
end. However, instead of explicitly incrementing all the ending positions, we can
replace the ending position by a pointer e which keeps track of the current phase
being processed.

If we replace [x, i − 1] with [x, e], then in phase i, if we set e = i, then immediately
all the l existing suffixes get implicitly extended to [x, i]. Thus, in one operation of
incrementing e we have, in effect, taken care of extensions 1 through l for phase i.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 29 / 37
Implicit Extensions: s = CAGAAGT $, Phase i = 7

[1, e] = CA
[1, e] = CAGAAG

[3,
[3,

e] =
[

[2,
e] =
2, 2

2]
]=

GA
=A
GA

GAAGT

AG
A

AG

T
(1,1) (1,3) (1,1) (1,3)

[3, e ]
[5, e ]
[3, e ]
[5, e ]

=
=
=
=

GAAG
AGT
GAAG
AG

(1,4) (1,2) (1,4) T (1,2)

(a) T6 (b) T7 , extensions j = 1, . . . , 4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 30 / 37
Ukkonen’s Algorithm: Skip/count Trick

For the jth extension of phase i, we have to search for the substring s[j : i − 1] so
that we can add si at the end.
Note that this string must exist in Ti −1 because we have already processed symbol
si −1 in the previous phase. Thus, instead of searching for each character in
s[j : i − 1] starting from the root, we first count the number of symbols on the
edge beginning with character sj ; let this length be m. If m is longer than the
length of the substring (i.e., if m > i − j), then the substring must end on this
edge, so we simply jump to position i − j and insert si .
On the other hand, if m ≤ i − j, then we can skip directly to the child node, say
vc , and search for the remaining string s[j + m : i − 1] from vc using the same
skip/count technique.
With this optimization, the cost of an extension becomes proportional to the
number of nodes on the path, as opposed to the number of characters in
s[j : i − 1].

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 31 / 37
Ukkonen’s Algorithm: Suffix Links

We can avoid searching for the substring s[j : i − 1] from the root via the use of
suff ix links.
For each internal node va we maintain a link to the internal node vb , where L(vb )
is the immediate suffix of L(va ).
In extension j − 1, let vp denote the internal node under which we find s[j − 1 : i],
and let m be the length of the node label of vp . To insert the jth extension s[j : i],
we follow the suffix link from vp to another node, say vs , and search for the
remaining substring s[j + m − 1 : i − 1] from vs .
The use of suffix links allows us to jump internally within the tree for different
extensions, as opposed to searching from the root each time.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 32 / 37
Linear Time Ukkonen Algorithm

Ukkonen (s):
1 n ← |s|
2 s[n + 1] ← $ // append terminal character
3 T ← ∅ // add empty string as root
4 l ← 0 // last explicit suffix
5 foreach i = 1, . . . , n + 1 do // phase i - construct Ti
6 e ← i // implicit extensions
7 foreach j = l + 1, . . . , i do // extension j for phase i
// Insert s[j : i] into the suffix tree
8 Find end of s[j : i − 1] in T via skip/count and suffix links
9 if si ∈ T then // implicit suffixes
10 break
11 else
12 Insert si at end of path
13 Set last explicit suffix l if needed

14 return T

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 33 / 37
Ukkonen’s Suffix Tree Construction: s = CAGAAGT $

C AGAAGT $, e = 1 C A GAAGT $, e = 2 CA G AAGT $, e = 3 CAG A AGT $, e = 4

[1, e] = CAGA
[1, e] = CAG
[1, e] = C

[2,
[2, e ] =

[3,
[2,
[1, e ] =

[3,

e]

e]
e] =

e] =

=
AG

GA
AG

G
A

CA

A
(1,1) (1,2) (1,1) (1,2) (1,1) (1,3) (1,2) (1,1) (1,3)

(a) T1 (b) T2 (c) T3 (d) T4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 34 / 37
Ukkonen’s Suffix Tree Construction: s = CAGAAGT $

CAGA A GT $, e = 5 CAGAA G T $, e = 6 CAGAAG T $, e = 7

[1, e] = CAGAAG

[1, e
[1, e] = CAGAA

[3,
[3,

]=
[2,

[2,

e] =

[7
e] =

[3, 3
[2

,e
2] =

2] =

CAG
,2

]=
GA

]=

]=
GA

T
A
A

AAG
AG
A

G
T
(1,1) (1,3) (1,1) (1,3)
(1,1) (1,7)
[3, e ] =
[5, e ] =

[4, e ] =
[3, e ] =
[5, e ] =

[5, e] = AG

[7, e ] =
[3, 3] = G
GAAG
AG

AAGT
GAA
A

T
T
(1,4) (1,2) (1,4) (1,2) (1,4) (1,3) (1,6)

[4, e ] =
(e) T5 (f) T6

[7, e ] =
AAGT

T
(1,2) (1,5)

(g) T7

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 35 / 37
Extensions in Phase i = 7

Extensions 1–4 Extension 5: AGT Extension 6: GT

[1, e] = CAGAAGT
[1, e] = CAGAAGT

[1, e] = CAGAAGT
[3, e

[3, e
[2, 2

[2, 2

[2, 2

[3, 3
]=

]=
]=

]=

]=
GAA

GAA

]=
A

G
GT

GT
vA (1,1) (1,3) vA (1,1) (1,3) vA (1,1) vG
[3, e ]

[4, e]
[5, e ]

[5, e] =

[5, e] =

[7, e]
[3, 3] =

[3, 3] =
=
=

=
GAA
AGT

AAGT
AGT

AGT

T
G

G
GT

(1,4) vAG
(1,4) (1,2) (1,4) vAG (1,3) (1,6)
[4, e ]

[4, e ]
[7, e ]

(a)

[7, e ]
=

=
=
AAG

AAG

=
T

T
T

T
(1,2) (1,5) (1,2) (1,5)

(b) (c)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 36 / 37
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 10: Sequence Mining

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 37 / 37

You might also like