Chapter 10: Sequence Mining
Chapter 10: Sequence Mining
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 1 / 37
Sequence Mining: Terminology
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 2 / 37
Sequence Mining: Terminology
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 3 / 37
Frequent Sequences
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 4 / 37
Mining Frequent Sequences
For sequence mining the order of the symbols matters, and thus we have to
consider all possible permutations of the symbols as the possible frequent
candidates. Contrast this with itemset mining, where we had only to consider
combinations of the items.
The sequence search space can be organized in a prefix search tree. The root of
the tree, at level 0, contains the empty sequence, with each symbol x ∈ Σ as one
of its children. As such, a node labeled with the sequence s = s1 s2 . . . sk at level k
has children of the form s ′ = s1 s2 . . . sk sk +1 at level k + 1. In other words, s is a
prefix of each child s ′ , which is also called an extension of s.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 5 / 37
Example Sequence Database
Id Sequence
s1 CAGAAGT
s2 TGACAG
s3 GAAGT
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 6 / 37
Level-wise Sequence Mining: GSP Algorithm
The GSP algorithm searches the sequence prefix tree using a level-wise or breadth-first
search. Given the set of frequent sequences at level k, we generate all possible sequence
extensions or candidates at level k + 1. We next compute the support of each candidate
and prune those that are not frequent. The search stops when no more frequent
extensions are possible.
The prefix search tree at level k is denoted C (k ) . Initially C (1) comprises all the symbols
in Σ. Given the current set of candidate k-sequences C (k ) , the method first computes
their support.
For each database sequence s i ∈ D, we check whether a candidate sequence r ∈ C (k ) is a
subsequence of s i . If so, we increment the support of r . Once the frequent sequences at
level k have been found, we generate the candidates for level k + 1.
For the extension, each leaf r a is extended with the last symbol of any other leaf r b that
shares the same prefix (i.e., has the same parent), to obtain the new candidate
(k + 1)-sequence r ab = r a + r b [k]. If the new candidate r ab contains any infrequent
k-sequence, we prune it.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 7 / 37
Algorithm GSP
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 8 / 37
Algorithm ComputeSupport
ComputeSupport (C (k ) , D):
1 foreach s i ∈ D do
2 foreach r ∈ C (k ) do
3 if r ⊆ s i then sup(r ) ← sup(r ) + 1
ExtendPrefixTree (C (k ) ):
1 foreach leaf r a ∈ C (k ) do
2 foreach leaf r b ∈ Children(Parent(r a )) do
3 r ab ← r a + r b [k] // extend r a with last item of r b
// prune if there are any infrequent subsequences
4 if r c ∈ C (k ) , for all r c ⊂ r ab , such that |r c | = |r ab | − 1 then
5 Add r ab as child of r a with sup(r ab ) ← 0
6 if no extensions from r a then
7 remove r a , and all ancestors of r a with no extensions, from
C (k )
8 return C (k )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 9 / 37
Sequence Search Space
shaded ovals are infrequent sequences
∅(3)
AAA(1) AAG (3) AGA(1) AGG (1) GAA(3) GAG (3) GGA(0) GGG (0)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 10 / 37
Vertical Sequence Mining: Spade
The Spade algorithm uses a vertical database representation for sequence mining.
For each symbol s ∈ Σ, we keep a set of tuples of the form hi, pos(s)i, where
pos(s) is the set of positions in the database sequence s i ∈ D where symbol s
appears.
Let L(s) denote the set of such sequence-position tuples for symbol s, which we
refer to as the poslist. The set of poslists for each symbol s ∈ Σ thus constitutes a
vertical representation of the input database.
Given k-sequence r , its poslist L(r ) maintains the list of positions for the
occurrences of the last symbol r [k] in each database sequence s i , provided r ⊆ s i .
The support of sequence r is simply the number of distinct sequences in which r
occurs, that is, sup(r ) = |L(r )|.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 11 / 37
Spade Algorithm
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 12 / 37
Spade Algorithm
// Initial Call: F ← ∅, k ← 0,
P ← hs, L(s)i | s ∈ Σ, sup(s) ≥ minsup
Spade (P, minsup, F, k):
1 foreach r a ∈ P do
2 F ← F ∪ (r a , sup(r a ))
3 Pa ← ∅
4 foreach r b ∈ P do
5 r ab = r a + r b [k]
6 L(r ab ) = L(r a ) ∩ L(r b )
7 if sup(r ab ) ≥ minsup
then
8 Pa ← Pa ∪ hr ab , L(r ab )i
9 if Pa 6= ∅ then Spade (Pa , minsup, F, k + 1)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 13 / 37
Sequence Mining via Spade
∅
A G T
C
1 2,4,5 1 3,6 1 7
1 1
2 3,5 2 2,6 2 1
2 4
3 2,3 3 1,4 3 5
AA AG GA GG
AT GT
1 4,5 1 3,6 1 4,5 1 6 TA TG
1 7 1 7
2 5 2 6 2 3,5 2 6 2 3,5 2 2,6
3 5 3 5
3 3 3 4 3 2,3 3 4
GAAG
1 6
2 6
3 4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 14 / 37
Projection-Based Sequence Mining: PrefixSpan
Let D denote a database, and let s ∈ Σ be any symbol. The projected database
with respect to s, denoted D s , is obtained by finding the first occurrence of s in
s i , say at position p. Next, we retain in D s only the suffix of s i starting at
position p + 1. Further, any infrequent symbols are removed from the suffix. This
is done for each sequence s i ∈ D.
PrefixSpan computes the support for only the individual symbols in the projected
database D s ; it then performs recursive projections on the frequent symbols in a
depth-first manner.
Given a frequent subsequence r , let D r be the projected dataset for r . Initially r
is empty and D r is the entire input dataset D. Given a database of (projected)
sequences D r , PrefixSpan first finds all the frequent symbols in the projected
dataset. For each such symbol s, we extend r by appending s to obtain the new
frequent subsequence r s . Next, we create the projected dataset D s by projecting
D r on symbol s. A recursive call to PrefixSpan is then made with r s and D s .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 15 / 37
PrefixSpan Algorithm
// Initial Call: D r ← D, r ← ∅, F ← ∅
PrefixSpan (D r , r , minsup, F):
1 foreach s ∈ Σ such that sup(s, D r ) ≥ minsup do
2 r s = r + s// extend r by symbol s
3 F ← F ∪ (r s , sup(s, D r ))
4 D s ← ∅ // create projected data for symbol s
5 foreach s i ∈ D r do
6 s ′i ← projection of s i w.r.t symbol s
7 Remove any infrequent symbols from s ′i
8 Add s ′i to D s if s ′i 6= ∅
9 if D s 6= ∅ then PrefixSpan (D s , r s , minsup, F)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 16 / 37
Projection-based Sequence Mining: PrefixSpan
D∅
s 1 CAGAAGT
s 2 TGACAG
s 3 GAAGT
A(3), C (2), G (3), T (3)
DA DG
s 1 GAAGT s 1 AAGT DT
s 2 AG s 2 AAG s 2 GAAG
s 3 AGT s 3 AAGT A(1), G (1)
A(3), G (3), T (2) A(3), G (3), T (2)
D AA D GA
s 1 AG D AG s 1 AG
s2 G s 1 AAG s 2 AG D GG
s3 G s 3 AG ∅
A(1), G (1)
A(1), G (3) A(3), G (3)
D GAA
s1 G
D AAG s2 G D GAG
∅ s3 G ∅
G (3)
D GAAG
∅
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 17 / 37
Substring Mining via Suffix Trees
Let s be a sequence having length n, then there are at most O(n2 ) possible
distinct substrings contained in s. This is a much smaller search space compared
to subsequences, and consequently we can design more efficient algorithms for
solving the frequent substring mining task.
Naively, we can mine all the frequent substrings in worst case O(Nn2 ) time for a
dataset D = {s 1 , s 2 , . . . , s N } with N sequences.
We will show that all sequences can be mined in O(Nn) time via Suffix Trees.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 18 / 37
Suffix Tree
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 19 / 37
Suffix Tree Construction for s = CAGAAGT $
Insert each suffix j per step
CAGAAGT $
CAGAAGT $
CAGAAGT $
CAGA
AGAA
AG
GA
GA
AA
AG
AG
A
AGT $
GT
GT $
T
T
$
$
$
(1,1) (1,2) (1,1) (1,2) (1,1) (1,3) (1,1) (1,3)
GAAG
AGT $
T$
(1,4) (1,2)
(d) j = 4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 20 / 37
Suffix Tree Construction for s = CAGAAGT $
Insert each suffix j per step
CAGAAGT $
CAGAAGT $
CAGAAGT $
GA
AG
T$
A
A
G
G
T $
AAGT
AAGT
AGT $
AGT $
AGT $
T$
T$
G
G
$
$
(1,4) (1,4) (1,3) (1,6) (1,4) (1,3) (1,6)
AAGT
AAGT
AAGT
T$
T$
T$
$
$
(1,2) (1,5) (1,2) (1,5) (1,2) (1,5)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 21 / 37
Suffix Tree for Entire Database
D = {s 1 = CAGAAGT , s 2 = TGACAG , s 3 = GAAGT }
CA
A
T
G
G
3 2 3 3
GAC
AAG
AG
CAG $
T$
T
$
A
$
G
AG $
$
T$
CAG
T$
GT
$
$
(1,5) (1,3)
(1,2) (2,5) (2,2)
(3,3) (3,1)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 22 / 37
Frequent Substrings
Once the suffix tree is built, we can compute all the frequent substrings by
checking how many different sequences appear in a leaf node or under an internal
node.
The node labels for the nodes with support at least minsup yield the set of
frequent substrings; all the prefixes of such node labels are also frequent.
The suffix tree can also support ad hoc queries for finding all the occurrences in
the database for any query substring q. For each symbol in q, we follow the path
from the root until all symbols in q have been seen, or until there is a mismatch
at any position. If q is found, then the set of leaves under that path is the list of
occurrences of the query q. On the other hand, if there is mismatch that means
the query does not occur in the database.
Because we have to match each character in q, we immediately get O(|q|) as the
time bound (assuming that |Σ| is a constant), which is independent of the size of
the database. Listing all the matches takes additional time, for a total time
complexity of O(|q| + k), if there are k matches.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 23 / 37
Ukkonen’s Linear Time Suffix Tree Algorithm
Achieving Linear Space: If an algorithm stores all the symbols on each edge
label, then the space complexity is O(n2 ), and we cannot achieve linear time
construction either.
The trick is to not explicitly store all the edge labels, but rather to use an
edge-compression technique, where we store only the starting and ending positions
of the edge label in the input string s. That is, if an edge label is given as s[i : j],
then we represent is as the interval [i, j].
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 24 / 37
Suffix Tree using Edge-compression: s = CAGAAGT $
v1
CAGAAGT $
A T$
G
[7,
[1, 8]
[2,
8]
[3,
2]
(1,1) (1,7)
3]
AAGT
AGT
T$
G
$
v2 (1,1) v4 (1,7)
(1,4) (1,3) (1,6)
AAGT
[5, 8
[4, 8]
T$
[7, 8]
[3, 3
$
]
(1,2) (1,5)
[4, 8]
[7, 8]
(1,2) (1,5)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 26 / 37
Algorithm NaiveUkkonen
NaiveUkkonen (s):
1 n ← |s|
2 s[n + 1] ← $ // append terminal character
3 T ← ∅ // add empty string as root
4 foreach i = 1, . . . , n + 1 do // phase i - construct Ti
5 foreach j = 1, . . . , i do // extension j for phase i
// Insert s[j : i] into the suffix tree
6 Find end of the path with label s[j : i − 1] in T
7 Insert si at end of path;
8 return T
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 27 / 37
Ukkonen’s Linear Time Algorithm: Implicit Suffixes
This optimization states that, in phase i, if the jth extension s[j : i] is found in the
tree, then any subsequent extensions will also be found, and consequently there is
no need to process further extensions in phase i.
Thus, the suffix tree Ti at the end of phase i has implicit suff ixes corresponding
to extensions j + 1 through i.
It is important to note that all suffixes will become explicit the first time we
encounter a new substring that does not already exist in the tree. This will surely
happen in phase n + 1 when we process the terminal character $, as it cannot
occur anywhere else in s (after all, $ 6∈ Σ).
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 28 / 37
Ukkonen’s Algorithm: Implicit Extensions
Let the current phase be i, and let l ≤ i − 1 be the last explicit suffix in the
previous tree Ti −1 .
All explicit suffixes in Ti −1 have edge labels of the form [x, i − 1] leading to the
corresponding leaf nodes, where the starting position x is node specific, but the
ending position must be i − 1 because si −1 was added to the end of these paths in
phase i − 1.
In the current phase i, we would have to extend these paths by adding si at the
end. However, instead of explicitly incrementing all the ending positions, we can
replace the ending position by a pointer e which keeps track of the current phase
being processed.
If we replace [x, i − 1] with [x, e], then in phase i, if we set e = i, then immediately
all the l existing suffixes get implicitly extended to [x, i]. Thus, in one operation of
incrementing e we have, in effect, taken care of extensions 1 through l for phase i.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 29 / 37
Implicit Extensions: s = CAGAAGT $, Phase i = 7
[1, e] = CA
[1, e] = CAGAAG
[3,
[3,
e] =
[
[2,
e] =
2, 2
2]
]=
GA
=A
GA
GAAGT
AG
A
AG
T
(1,1) (1,3) (1,1) (1,3)
[3, e ]
[5, e ]
[3, e ]
[5, e ]
=
=
=
=
GAAG
AGT
GAAG
AG
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 30 / 37
Ukkonen’s Algorithm: Skip/count Trick
For the jth extension of phase i, we have to search for the substring s[j : i − 1] so
that we can add si at the end.
Note that this string must exist in Ti −1 because we have already processed symbol
si −1 in the previous phase. Thus, instead of searching for each character in
s[j : i − 1] starting from the root, we first count the number of symbols on the
edge beginning with character sj ; let this length be m. If m is longer than the
length of the substring (i.e., if m > i − j), then the substring must end on this
edge, so we simply jump to position i − j and insert si .
On the other hand, if m ≤ i − j, then we can skip directly to the child node, say
vc , and search for the remaining string s[j + m : i − 1] from vc using the same
skip/count technique.
With this optimization, the cost of an extension becomes proportional to the
number of nodes on the path, as opposed to the number of characters in
s[j : i − 1].
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 31 / 37
Ukkonen’s Algorithm: Suffix Links
We can avoid searching for the substring s[j : i − 1] from the root via the use of
suff ix links.
For each internal node va we maintain a link to the internal node vb , where L(vb )
is the immediate suffix of L(va ).
In extension j − 1, let vp denote the internal node under which we find s[j − 1 : i],
and let m be the length of the node label of vp . To insert the jth extension s[j : i],
we follow the suffix link from vp to another node, say vs , and search for the
remaining substring s[j + m − 1 : i − 1] from vs .
The use of suffix links allows us to jump internally within the tree for different
extensions, as opposed to searching from the root each time.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 32 / 37
Linear Time Ukkonen Algorithm
Ukkonen (s):
1 n ← |s|
2 s[n + 1] ← $ // append terminal character
3 T ← ∅ // add empty string as root
4 l ← 0 // last explicit suffix
5 foreach i = 1, . . . , n + 1 do // phase i - construct Ti
6 e ← i // implicit extensions
7 foreach j = l + 1, . . . , i do // extension j for phase i
// Insert s[j : i] into the suffix tree
8 Find end of s[j : i − 1] in T via skip/count and suffix links
9 if si ∈ T then // implicit suffixes
10 break
11 else
12 Insert si at end of path
13 Set last explicit suffix l if needed
14 return T
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 33 / 37
Ukkonen’s Suffix Tree Construction: s = CAGAAGT $
[1, e] = CAGA
[1, e] = CAG
[1, e] = C
[2,
[2, e ] =
[3,
[2,
[1, e ] =
[3,
e]
e]
e] =
e] =
=
AG
GA
AG
G
A
CA
A
(1,1) (1,2) (1,1) (1,2) (1,1) (1,3) (1,2) (1,1) (1,3)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 34 / 37
Ukkonen’s Suffix Tree Construction: s = CAGAAGT $
[1, e] = CAGAAG
[1, e
[1, e] = CAGAA
[3,
[3,
]=
[2,
[2,
e] =
[7
e] =
[3, 3
[2
,e
2] =
2] =
CAG
,2
]=
GA
]=
]=
GA
T
A
A
AAG
AG
A
G
T
(1,1) (1,3) (1,1) (1,3)
(1,1) (1,7)
[3, e ] =
[5, e ] =
[4, e ] =
[3, e ] =
[5, e ] =
[5, e] = AG
[7, e ] =
[3, 3] = G
GAAG
AG
AAGT
GAA
A
T
T
(1,4) (1,2) (1,4) (1,2) (1,4) (1,3) (1,6)
[4, e ] =
(e) T5 (f) T6
[7, e ] =
AAGT
T
(1,2) (1,5)
(g) T7
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 35 / 37
Extensions in Phase i = 7
[1, e] = CAGAAGT
[1, e] = CAGAAGT
[1, e] = CAGAAGT
[3, e
[3, e
[2, 2
[2, 2
[2, 2
[3, 3
]=
]=
]=
]=
]=
GAA
GAA
]=
A
G
GT
GT
vA (1,1) (1,3) vA (1,1) (1,3) vA (1,1) vG
[3, e ]
[4, e]
[5, e ]
[5, e] =
[5, e] =
[7, e]
[3, 3] =
[3, 3] =
=
=
=
GAA
AGT
AAGT
AGT
AGT
T
G
G
GT
(1,4) vAG
(1,4) (1,2) (1,4) vAG (1,3) (1,6)
[4, e ]
[4, e ]
[7, e ]
(a)
[7, e ]
=
=
=
AAG
AAG
=
T
T
T
T
(1,2) (1,5) (1,2) (1,5)
(b) (c)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 36 / 37
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 37 / 37