0% found this document useful (0 votes)

93 views37 pages

Chapter 10: Sequence Mining

The document summarizes key concepts in sequence mining and frequent sequence mining algorithms. It defines basic terminology like sequences, subsequences, and support. It then describes two algorithms for mining frequent sequences from sequence databases: GSP, a level-wise, breadth-first search algorithm, and SPADE, a vertical mining approach that uses sequential joins on position lists. The document provides pseudocode for the GSP algorithm and explains how it extends sequences level-by-level to find all frequent sequences above a minimum support threshold.

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views37 pages

Chapter 10: Sequence Mining

Uploaded by

s8nd11d UNI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 10: Sequence Mining

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 1 / 37
Sequence Mining: Terminology

Let Σ be the alphabet, a set of symbols. A sequence or a string is defined as an

ordered list of symbols, and is written as s = s1 s2 . . . sk , where si ∈ Σ is a symbol
at position i, also denoted as s[i]. |s| = k denotes the length of the sequence.
The notation s[i : j] = si si +1 · · · sj −1 sj denotes the substring or sequence of
consecutive symbols in positions i through j, where j > i.
Define the pref ix of a sequence s as any substring of the form s[1 : i] = s1 s2 . . . si ,
with 0 ≤ i ≤ n.
Define the suff ix of s as any substring of the form s[i : n] = si si +1 . . . sn , with
1 ≤ i ≤ n + 1.
s[1 : 0] is the empty prefix, and s[n + 1 : n] is the empty suffix. Let Σ⋆ be the set
of all possible sequences that can be constructed using the symbols in Σ,
including the empty sequence ∅ (which has length zero).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 2 / 37
Sequence Mining: Terminology

Let s = s1 s2 . . . sn and r = r1 r2 . . . rm be two sequences over Σ. We say that r is a

subsequence of s denoted r ⊆ s, if there exists a one-to-one mapping
φ : [1, m] → [1, n], such that r [i] = s[φ(i)] and for any two positions i, j in r ,
i < j =⇒ φ(i) < φ(j). In If r ⊆ s, we also say that s contains r .
The sequence r is called a consecutive subsequence or substring of s provided
r1 r2 . . . rm = sj sj +1 . . . sj +m−1 , i.e., r [1 : m] = s[j : j + m − 1], with 1 ≤ j ≤ n − m + 1.

Let Σ = {A, C , G , T }, and let s = ACTGAACG .

Then r 1 = CGAAG is a subsequence of s, and r 2 = CTGA is a substring of s. The
sequence r 3 = ACT is a prefix of s, and so is r 4 = ACTGA, whereas r 5 = GAACG
is one of the suffixes of s.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 3 / 37
Frequent Sequences

Given a database D = {s 1 , s 2 , . . . , s N } of N sequences, and given some sequence

r , the support of r in the database D is defined as the total number of sequences
in D that contain r

sup(r ) = s i ∈ D|r ⊆ s i

The relative support of r is the fraction of sequences that contain r

rsup(r ) = sup(r )/N

Given a user-specified minsup threshold, we say that a sequence r is frequent in

database D if sup(r ) ≥ minsup. A frequent sequence is maximal if it is not a
subsequence of any other frequent sequence, and a frequent sequence is closed if
it is not a subsequence of any other frequent sequence with the same support.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 4 / 37
Mining Frequent Sequences

For sequence mining the order of the symbols matters, and thus we have to
consider all possible permutations of the symbols as the possible frequent
candidates. Contrast this with itemset mining, where we had only to consider
combinations of the items.
The sequence search space can be organized in a prefix search tree. The root of
the tree, at level 0, contains the empty sequence, with each symbol x ∈ Σ as one
of its children. As such, a node labeled with the sequence s = s1 s2 . . . sk at level k
has children of the form s ′ = s1 s2 . . . sk sk +1 at level k + 1. In other words, s is a
prefix of each child s ′ , which is also called an extension of s.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 5 / 37
Example Sequence Database

Id Sequence
s1 CAGAAGT
s2 TGACAG
s3 GAAGT

Using minsup = 3, the set of frequent subsequences is given as:

F (1) = A(3), G (3), T (3)

F (2) = AA(3), AG (3), GA(3), GG (3)
F (3) = AAG (3), GAA(3), GAG (3)
F (4) = GAAG (3)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 6 / 37
Level-wise Sequence Mining: GSP Algorithm

The GSP algorithm searches the sequence prefix tree using a level-wise or breadth-first
search. Given the set of frequent sequences at level k, we generate all possible sequence
extensions or candidates at level k + 1. We next compute the support of each candidate
and prune those that are not frequent. The search stops when no more frequent
extensions are possible.
The prefix search tree at level k is denoted C (k ) . Initially C (1) comprises all the symbols
in Σ. Given the current set of candidate k-sequences C (k ) , the method first computes
their support.
For each database sequence s i ∈ D, we check whether a candidate sequence r ∈ C (k ) is a
subsequence of s i . If so, we increment the support of r . Once the frequent sequences at
level k have been found, we generate the candidates for level k + 1.
For the extension, each leaf r a is extended with the last symbol of any other leaf r b that
shares the same prefix (i.e., has the same parent), to obtain the new candidate
(k + 1)-sequence r ab = r a + r b [k]. If the new candidate r ab contains any infrequent
k-sequence, we prune it.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 7 / 37
Algorithm GSP

GSP (D, Σ, minsup):

1 F ←∅
2 C (1) ← {∅} // Initial prefix tree with single symbols
3 foreach s ∈ Σ do Add s as child of ∅ in C (1) with sup(s) ← 0
4 k ← 1 // k denotes the level
5 while C (k ) 6= ∅ do
6 ComputeSupport (C (k ) , D)
7 foreach leaf s ∈ C (k ) do
8 if sup(r ) ≥ minsup then F ← F ∪ (r , sup(r ))
9 else remove s from C (k )
10 C (k +1) ← ExtendPrefixTree (C (k ) )
11 k ← k +1
12 return F (k )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 8 / 37
Algorithm ComputeSupport
ComputeSupport (C (k ) , D):
1 foreach s i ∈ D do
2 foreach r ∈ C (k ) do
3 if r ⊆ s i then sup(r ) ← sup(r ) + 1

ExtendPrefixTree (C (k ) ):
1 foreach leaf r a ∈ C (k ) do
2 foreach leaf r b ∈ Children(Parent(r a )) do
3 r ab ← r a + r b [k] // extend r a with last item of r b
// prune if there are any infrequent subsequences
4 if r c ∈ C (k ) , for all r c ⊂ r ab , such that |r c | = |r ab | − 1 then
5 Add r ab as child of r a with sup(r ab ) ← 0
6 if no extensions from r a then
7 remove r a , and all ancestors of r a with no extensions, from
C (k )
8 return C (k )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 9 / 37
Sequence Search Space
shaded ovals are infrequent sequences

∅(3)

A(3) C (2) G (3) T (3)

AA(3) AG (3) AT (2) GA(3) GG (3) GT (2) TA(1) TG (1) TT (0)

AAA(1) AAG (3) AGA(1) AGG (1) GAA(3) GAG (3) GGA(0) GGG (0)

AAGG GAAA GAAG (3) GAGA GAGG

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 10 / 37
Vertical Sequence Mining: Spade

The Spade algorithm uses a vertical database representation for sequence mining.
For each symbol s ∈ Σ, we keep a set of tuples of the form hi, pos(s)i, where
pos(s) is the set of positions in the database sequence s i ∈ D where symbol s
appears.
Let L(s) denote the set of such sequence-position tuples for symbol s, which we
refer to as the poslist. The set of poslists for each symbol s ∈ Σ thus constitutes a
vertical representation of the input database.
Given k-sequence r , its poslist L(r ) maintains the list of positions for the
occurrences of the last symbol r [k] in each database sequence s i , provided r ⊆ s i .
The support of sequence r is simply the number of distinct sequences in which r
occurs, that is, sup(r ) = |L(r )|.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 11 / 37
Spade Algorithm

Support computation in Spade is done via sequential join operations.

Given the poslists for any two k-sequences r a and r b that share the same (k − 1)
length prefix, a sequential join on the poslists is used to compute the support for
the new (k + 1) length candidate sequence r ab = r a + r b [k].

Given

a tuple i, pos r b [k] ∈ L(r b ), we first check if there exists a tuple

i, pos r a [k] ∈ L(r a ), that is, both sequences must occur in the same database
sequence s i .

Next, for each position p ∈ pos r b [k] , we check whether there exists a position
q ∈ pos r a [k] such that q < p. If yes, this means that the symbol r b [k] occurs
after the last position of r a and thus we retain p as a valid occurrence of r ab . The
poslist L(r ab ) comprises all such valid occurrences.
We keep track of positions only for the last symbol in the candidate sequence
since we extend sequences from a common prefix, and so there is no need to keep
track of all the occurrences of the symbols in the prefix.
We denote the sequential join as L(r ab ) = L(r a ) ∩ L(r b ).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 12 / 37
Spade Algorithm

// Initial Call: F ← ∅, k ← 0,

P ← hs, L(s)i | s ∈ Σ, sup(s) ≥ minsup
Spade (P, minsup, F, k):
1 foreach r a ∈ P do
2 F ← F ∪ (r a , sup(r a ))
3 Pa ← ∅
4 foreach r b ∈ P do
5 r ab = r a + r b [k]
6 L(r ab ) = L(r a ) ∩ L(r b )
7 if sup(r ab ) ≥ minsup
then
8 Pa ← Pa ∪ hr ab , L(r ab )i
9 if Pa 6= ∅ then Spade (Pa , minsup, F, k + 1)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 13 / 37
Sequence Mining via Spade
∅

A G T
C
1 2,4,5 1 3,6 1 7
1 1
2 3,5 2 2,6 2 1
2 4
3 2,3 3 1,4 3 5

AA AG GA GG
AT GT
1 4,5 1 3,6 1 4,5 1 6 TA TG
1 7 1 7
2 5 2 6 2 3,5 2 6 2 3,5 2 2,6
3 5 3 5
3 3 3 4 3 2,3 3 4

AAG GAA GAG

AAA 1 6 AGA AGG 1 5 1 6
1 5 2 6 1 5 1 6 2 5 2 6
3 4 3 3 3 4

GAAG
1 6
2 6
3 4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 14 / 37
Projection-Based Sequence Mining: PrefixSpan

Let D denote a database, and let s ∈ Σ be any symbol. The projected database
with respect to s, denoted D s , is obtained by finding the first occurrence of s in
s i , say at position p. Next, we retain in D s only the suffix of s i starting at
position p + 1. Further, any infrequent symbols are removed from the suffix. This
is done for each sequence s i ∈ D.
PrefixSpan computes the support for only the individual symbols in the projected
database D s ; it then performs recursive projections on the frequent symbols in a
depth-first manner.
Given a frequent subsequence r , let D r be the projected dataset for r . Initially r
is empty and D r is the entire input dataset D. Given a database of (projected)
sequences D r , PrefixSpan first finds all the frequent symbols in the projected
dataset. For each such symbol s, we extend r by appending s to obtain the new
frequent subsequence r s . Next, we create the projected dataset D s by projecting
D r on symbol s. A recursive call to PrefixSpan is then made with r s and D s .

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 15 / 37
PrefixSpan Algorithm

// Initial Call: D r ← D, r ← ∅, F ← ∅
PrefixSpan (D r , r , minsup, F):
1 foreach s ∈ Σ such that sup(s, D r ) ≥ minsup do
2 r s = r + s// extend r by symbol s
3 F ← F ∪ (r s , sup(s, D r ))
4 D s ← ∅ // create projected data for symbol s
5 foreach s i ∈ D r do
6 s ′i ← projection of s i w.r.t symbol s
7 Remove any infrequent symbols from s ′i
8 Add s ′i to D s if s ′i 6= ∅
9 if D s 6= ∅ then PrefixSpan (D s , r s , minsup, F)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 16 / 37
Projection-based Sequence Mining: PrefixSpan
D∅
s 1 CAGAAGT
s 2 TGACAG
s 3 GAAGT
A(3), C (2), G (3), T (3)

DA DG
s 1 GAAGT s 1 AAGT DT
s 2 AG s 2 AAG s 2 GAAG
s 3 AGT s 3 AAGT A(1), G (1)
A(3), G (3), T (2) A(3), G (3), T (2)

D AA D GA
s 1 AG D AG s 1 AG
s2 G s 1 AAG s 2 AG D GG
s3 G s 3 AG ∅
A(1), G (1)
A(1), G (3) A(3), G (3)

D GAA
s1 G
D AAG s2 G D GAG
∅ s3 G ∅
G (3)

D GAAG
∅

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 17 / 37
Substring Mining via Suffix Trees

Let s be a sequence having length n, then there are at most O(n2 ) possible
distinct substrings contained in s. This is a much smaller search space compared
to subsequences, and consequently we can design more efficient algorithms for
solving the frequent substring mining task.

Naively, we can mine all the frequent substrings in worst case O(Nn2 ) time for a
dataset D = {s 1 , s 2 , . . . , s N } with N sequences.

We will show that all sequences can be mined in O(Nn) time via Suffix Trees.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 18 / 37
Suffix Tree

Given a sequence s, we append a terminal character $ 6∈ Σ so that

s = s1 s2 . . . sn sn+1 , where sn+1 = $, and the jth suffix of s is given as
s[j : n + 1] = sj sj +1 . . . sn+1 .
The suff ix tree of the sequences in the database D, denoted T , stores all the
suffixes for each s i ∈ D in a tree structure, where suffixes that share a common
prefix lie on the same path from the root of the tree.
The substring obtained by concatenating all the symbols from the root node to a
node v is called the node label of v , and is denoted as L(v ). The substring that
appears on an edge (va , vb ) is called an edge label, and is denoted as L(va , vb ).
A suffix tree has two kinds of nodes: internal and leaf nodes. An internal node in
the suffix tree (except for the root) has at least two children, where each edge
label to a child begins with a different symbol. Because the terminal character is
unique, there are as many leaves in the suffix tree as there are unique suffixes over
all the sequences. Each leaf node corresponds to a suffix shared by one or more
sequences in D.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 19 / 37
Suffix Tree Construction for s = CAGAAGT $
Insert each suffix j per step
CAGAAGT $

CAGAAGT $

CAGAAGT $
CAGA
AGAA

GA
AA

AG
A
AGT $

GT
GT $

T
T

$
$
$
(1,1) (1,2) (1,1) (1,2) (1,1) (1,3) (1,1) (1,3)

(a) j = 1 (b) j = 2 (c) j = 3

GAAG
AGT $

T$
(1,4) (1,2)

(d) j = 4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 20 / 37
Suffix Tree Construction for s = CAGAAGT $
Insert each suffix j per step

CAGAAGT $
CAGAAGT $

CAGAAGT $
GA
AG

T$
A
A

G
G
T $

(1,1) (1,3) (1,1) (1,1) (1,7)

AAGT

AAGT
AGT $

AGT $

AGT $
T$

T$
G

G
$

$
(1,4) (1,4) (1,3) (1,6) (1,4) (1,3) (1,6)
AAGT

AAGT

AAGT
T$

T$
$

$
(1,2) (1,5) (1,2) (1,5) (1,2) (1,5)

(e) j = 5 (f) j = 6 (g) j = 7

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 21 / 37
Suffix Tree for Entire Database
D = {s 1 = CAGAAGT , s 2 = TGACAG , s 3 = GAAGT }

CA
A
T

G
G
3 2 3 3

GAC
AAG
AG

CAG $

T$
T

$
A
$
G

AG $
$

(1,4) (1,6) (1,7)

(2,3) 3 (1,1) (2,4) 3 (2,6) (2,1)
(3,2) AGT (3,4) (3,5)
AA

CAG
T$
GT

$
$

(1,5) (1,3)
(1,2) (2,5) (2,2)
(3,3) (3,1)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 22 / 37
Frequent Substrings

Once the suffix tree is built, we can compute all the frequent substrings by
checking how many different sequences appear in a leaf node or under an internal
node.
The node labels for the nodes with support at least minsup yield the set of
frequent substrings; all the prefixes of such node labels are also frequent.
The suffix tree can also support ad hoc queries for finding all the occurrences in
the database for any query substring q. For each symbol in q, we follow the path
from the root until all symbols in q have been seen, or until there is a mismatch
at any position. If q is found, then the set of leaves under that path is the list of
occurrences of the query q. On the other hand, if there is mismatch that means
the query does not occur in the database.
Because we have to match each character in q, we immediately get O(|q|) as the
time bound (assuming that |Σ| is a constant), which is independent of the size of
the database. Listing all the matches takes additional time, for a total time
complexity of O(|q| + k), if there are k matches.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 23 / 37
Ukkonen’s Linear Time Suffix Tree Algorithm

Achieving Linear Space: If an algorithm stores all the symbols on each edge
label, then the space complexity is O(n2 ), and we cannot achieve linear time
construction either.

The trick is to not explicitly store all the edge labels, but rather to use an
edge-compression technique, where we store only the starting and ending positions
of the edge label in the input string s. That is, if an edge label is given as s[i : j],
then we represent is as the interval [i, j].

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 24 / 37
Suffix Tree using Edge-compression: s = CAGAAGT $

CAGAAGT $
A T$

G
[7,

[1, 8]
[2,
8]

[3,
2]
(1,1) (1,7)

3]
AAGT
AGT

T$
G
$

v2 (1,1) v4 (1,7)
(1,4) (1,3) (1,6)
AAGT

[5, 8

[4, 8]
T$

[7, 8]
[3, 3
$

]
(1,2) (1,5)

(a) Full Tree

(1,4) v3 (1,3) (1,6)

[4, 8]
[7, 8]

(1,2) (1,5)

(b) Compressed Tree

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 25 / 37
Ukkonen Algorithm: Achieving Linear Time

Ukkonen’s method is an online algorithm, that is, given a string s = s1 s2 . . . sn $ it

constructs the full suffix tree in phases.
Phase i builds the tree up to the i-th symbol in s. Let Ti denote the suffix tree up
to the ith prefix s[1 : i], with 1 ≤ i ≤ n. Ukkonen’s algorithm constructs Ti from
Ti −1 , by making sure that all suffixes including the current character si are in the
new intermediate tree Ti .
In other words, in the ith phase, it inserts all the suffixes s[j : i] from j = 1 to j = i
into the tree Ti . Each such insertion is called the jth extension of the ith phase.
Once we process the terminal character at position n + 1 we obtain the final suffix
tree T for s.
However, this naive Ukkonen method has cubic time complexity because to obtain
Ti from Ti −1 takes O(i 2 ) time, with the last phase requiring O(n2 ) time. With n
phases, the total time is O(n3 ). We will show that this time can be reduced to
O(n).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 26 / 37
Algorithm NaiveUkkonen

NaiveUkkonen (s):
1 n ← |s|
2 s[n + 1] ← $ // append terminal character
3 T ← ∅ // add empty string as root
4 foreach i = 1, . . . , n + 1 do // phase i - construct Ti
5 foreach j = 1, . . . , i do // extension j for phase i
// Insert s[j : i] into the suffix tree
6 Find end of the path with label s[j : i − 1] in T
7 Insert si at end of path;
8 return T

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 27 / 37
Ukkonen’s Linear Time Algorithm: Implicit Suffixes

This optimization states that, in phase i, if the jth extension s[j : i] is found in the
tree, then any subsequent extensions will also be found, and consequently there is
no need to process further extensions in phase i.

Thus, the suffix tree Ti at the end of phase i has implicit suff ixes corresponding
to extensions j + 1 through i.

It is important to note that all suffixes will become explicit the first time we
encounter a new substring that does not already exist in the tree. This will surely
happen in phase n + 1 when we process the terminal character $, as it cannot
occur anywhere else in s (after all, $ 6∈ Σ).

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 28 / 37
Ukkonen’s Algorithm: Implicit Extensions

Let the current phase be i, and let l ≤ i − 1 be the last explicit suffix in the
previous tree Ti −1 .

All explicit suffixes in Ti −1 have edge labels of the form [x, i − 1] leading to the
corresponding leaf nodes, where the starting position x is node specific, but the
ending position must be i − 1 because si −1 was added to the end of these paths in
phase i − 1.

In the current phase i, we would have to extend these paths by adding si at the
end. However, instead of explicitly incrementing all the ending positions, we can
replace the ending position by a pointer e which keeps track of the current phase
being processed.

If we replace [x, i − 1] with [x, e], then in phase i, if we set e = i, then immediately
all the l existing suffixes get implicitly extended to [x, i]. Thus, in one operation of
incrementing e we have, in effect, taken care of extensions 1 through l for phase i.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 29 / 37
Implicit Extensions: s = CAGAAGT $, Phase i = 7

[1, e] = CA
[1, e] = CAGAAG

[3,
[3,

e] =
[

[2,
e] =
2, 2

2]
]=

GA
=A
GA

GAAGT

AG
A

T
(1,1) (1,3) (1,1) (1,3)

[3, e ]
[5, e ]
[3, e ]
[5, e ]

=
=
=
=

GAAG
AGT
GAAG
AG

(1,4) (1,2) (1,4) T (1,2)

(a) T6 (b) T7 , extensions j = 1, . . . , 4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 30 / 37
Ukkonen’s Algorithm: Skip/count Trick

For the jth extension of phase i, we have to search for the substring s[j : i − 1] so
that we can add si at the end.
Note that this string must exist in Ti −1 because we have already processed symbol
si −1 in the previous phase. Thus, instead of searching for each character in
s[j : i − 1] starting from the root, we first count the number of symbols on the
edge beginning with character sj ; let this length be m. If m is longer than the
length of the substring (i.e., if m > i − j), then the substring must end on this
edge, so we simply jump to position i − j and insert si .
On the other hand, if m ≤ i − j, then we can skip directly to the child node, say
vc , and search for the remaining string s[j + m : i − 1] from vc using the same
skip/count technique.
With this optimization, the cost of an extension becomes proportional to the
number of nodes on the path, as opposed to the number of characters in
s[j : i − 1].

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 31 / 37
Ukkonen’s Algorithm: Suffix Links

We can avoid searching for the substring s[j : i − 1] from the root via the use of
suff ix links.
For each internal node va we maintain a link to the internal node vb , where L(vb )
is the immediate suffix of L(va ).
In extension j − 1, let vp denote the internal node under which we find s[j − 1 : i],
and let m be the length of the node label of vp . To insert the jth extension s[j : i],
we follow the suffix link from vp to another node, say vs , and search for the
remaining substring s[j + m − 1 : i − 1] from vs .
The use of suffix links allows us to jump internally within the tree for different
extensions, as opposed to searching from the root each time.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 32 / 37
Linear Time Ukkonen Algorithm

Ukkonen (s):
1 n ← |s|
2 s[n + 1] ← $ // append terminal character
3 T ← ∅ // add empty string as root
4 l ← 0 // last explicit suffix
5 foreach i = 1, . . . , n + 1 do // phase i - construct Ti
6 e ← i // implicit extensions
7 foreach j = l + 1, . . . , i do // extension j for phase i
// Insert s[j : i] into the suffix tree
8 Find end of s[j : i − 1] in T via skip/count and suffix links
9 if si ∈ T then // implicit suffixes
10 break
11 else
12 Insert si at end of path
13 Set last explicit suffix l if needed

14 return T

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 33 / 37
Ukkonen’s Suffix Tree Construction: s = CAGAAGT $

C AGAAGT $, e = 1 C A GAAGT $, e = 2 CA G AAGT $, e = 3 CAG A AGT $, e = 4

[1, e] = CAGA
[1, e] = CAG
[1, e] = C

[2,
[2, e ] =

[3,
[2,
[1, e ] =

[3,

e]
e] =

e] =

=
AG

GA
AG

G
A

A
(1,1) (1,2) (1,1) (1,2) (1,1) (1,3) (1,2) (1,1) (1,3)

(a) T1 (b) T2 (c) T3 (d) T4

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 34 / 37
Ukkonen’s Suffix Tree Construction: s = CAGAAGT $

CAGA A GT $, e = 5 CAGAA G T $, e = 6 CAGAAG T $, e = 7

[1, e] = CAGAAG

[1, e
[1, e] = CAGAA

[3,
[3,

]=
[2,

[2,

e] =

[7
e] =

[3, 3
[2

,e
2] =

2] =

CAG
,2

]=
GA

T
A
A

AAG
AG
A

G
T
(1,1) (1,3) (1,1) (1,3)
(1,1) (1,7)
[3, e ] =
[5, e ] =

[4, e ] =
[3, e ] =
[5, e ] =

[5, e] = AG

[7, e ] =
[3, 3] = G
GAAG
AG

AAGT
GAA
A

T
T
(1,4) (1,2) (1,4) (1,2) (1,4) (1,3) (1,6)

[4, e ] =
(e) T5 (f) T6

[7, e ] =
AAGT

T
(1,2) (1,5)

(g) T7

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 35 / 37
Extensions in Phase i = 7

Extensions 1–4 Extension 5: AGT Extension 6: GT

[1, e] = CAGAAGT
[1, e] = CAGAAGT

[1, e] = CAGAAGT
[3, e

[3, e
[2, 2

[2, 2

[3, 3
]=

]=
]=

]=
GAA

GAA

]=
A

G
GT

GT
vA (1,1) (1,3) vA (1,1) (1,3) vA (1,1) vG
[3, e ]

[4, e]
[5, e ]

[5, e] =

[7, e]
[3, 3] =

[3, 3] =
=
=

=
GAA
AGT

AAGT
AGT

AGT

T
G

G
GT

(1,4) vAG
(1,4) (1,2) (1,4) vAG (1,3) (1,6)
[4, e ]

[4, e ]
[7, e ]

(a)

[7, e ]
=

=
=
AAG

AAG

=
T

T
T

T
(1,2) (1,5) (1,2) (1,5)

(b) (c)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 36 / 37
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 10: Sequence Mining

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 10: Sequence Mining 37 / 37

Caie As Level Psychology 9990 Methodology 63d5229efa0a7313631e05cb 853
No ratings yet
Caie As Level Psychology 9990 Methodology 63d5229efa0a7313631e05cb 853
9 pages
Mining Sequential Patterns
No ratings yet
Mining Sequential Patterns
43 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
26 pages
Lecture 13
No ratings yet
Lecture 13
43 pages
An Updown Directed Acyclic Graph Approach For Sequential Pattern Mining
No ratings yet
An Updown Directed Acyclic Graph Approach For Sequential Pattern Mining
67 pages
Data Mining - Mining Sequential Patterns
No ratings yet
Data Mining - Mining Sequential Patterns
10 pages
PrefixSpan The Presentation
No ratings yet
PrefixSpan The Presentation
93 pages
Sequential Pattern Mining
No ratings yet
Sequential Pattern Mining
24 pages
Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
No ratings yet
Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
26 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
34 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
DM Lect 5 - Sequence & Stream Mining
No ratings yet
DM Lect 5 - Sequence & Stream Mining
32 pages
LAN - PAKDD2014 - Sequential - Pattern - Mining - CM-SPADE - CM-SPAM
No ratings yet
LAN - PAKDD2014 - Sequential - Pattern - Mining - CM-SPADE - CM-SPAM
13 pages
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions
No ratings yet
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions
38 pages
Improved Sequential Pattern Mining Using An Extended Bitmap Representation
No ratings yet
Improved Sequential Pattern Mining Using An Extended Bitmap Representation
11 pages
Icremental Mining of Sequential Pattern
No ratings yet
Icremental Mining of Sequential Pattern
25 pages
Good One
No ratings yet
Good One
12 pages
L13-16 Sequential Patterns
No ratings yet
L13-16 Sequential Patterns
36 pages
Ada Chapt6 Tronsform and Conquer
No ratings yet
Ada Chapt6 Tronsform and Conquer
106 pages
Pattern Sequence Mining: Presented By: Devika Mittal
No ratings yet
Pattern Sequence Mining: Presented By: Devika Mittal
15 pages
Data Mining Patrones Secuenciales
No ratings yet
Data Mining Patrones Secuenciales
59 pages
Selection Sort
No ratings yet
Selection Sort
5 pages
Mining Sequential Patterns: E-Mail: Arif@its-Sby - Edu URL: WWW - Its-Sby - Edu/ Arif
No ratings yet
Mining Sequential Patterns: E-Mail: Arif@its-Sby - Edu URL: WWW - Its-Sby - Edu/ Arif
25 pages
Efficient Mining of Correlated Sequential Patterns Based On Null Hypothesis
No ratings yet
Efficient Mining of Correlated Sequential Patterns Based On Null Hypothesis
8 pages
Sequential Pattern Mining: A Comparison Between GSP, SPADE and Prefix SPAN
No ratings yet
Sequential Pattern Mining: A Comparison Between GSP, SPADE and Prefix SPAN
21 pages
Unit III
No ratings yet
Unit III
42 pages
Sample of The Study Material Part of Chapter 6 Sorting Algorithms
No ratings yet
Sample of The Study Material Part of Chapter 6 Sorting Algorithms
10 pages
Sequential Pattern Mining
No ratings yet
Sequential Pattern Mining
3 pages
Unit 2 Daa PDF
No ratings yet
Unit 2 Daa PDF
99 pages
Unit 3
No ratings yet
Unit 3
32 pages
Efficient Mining of Top-K Sequential Rules: Philippe Fournier-Viger
No ratings yet
Efficient Mining of Top-K Sequential Rules: Philippe Fournier-Viger
21 pages
Summarizing Sequential Data With Closed Partial Orders
No ratings yet
Summarizing Sequential Data With Closed Partial Orders
12 pages
KMP Algorithm For Strings
No ratings yet
KMP Algorithm For Strings
4 pages
Unit I
No ratings yet
Unit I
11 pages
Unit 5 C Language Hand Written
No ratings yet
Unit 5 C Language Hand Written
23 pages
Sequence Analysis: Athira P-AM - BU.P2MBA20029
No ratings yet
Sequence Analysis: Athira P-AM - BU.P2MBA20029
14 pages
Data Structures & Algorithm Analysis
No ratings yet
Data Structures & Algorithm Analysis
26 pages
Data Structures Se E&Tc List of Practicals
No ratings yet
Data Structures Se E&Tc List of Practicals
23 pages
Fundamental of Data Structure - Ms.S.D.Patil
No ratings yet
Fundamental of Data Structure - Ms.S.D.Patil
87 pages
DS Unit 4
No ratings yet
DS Unit 4
23 pages
Mining High Utility Patterns in One Phase Without Generating Candidates
No ratings yet
Mining High Utility Patterns in One Phase Without Generating Candidates
17 pages
Sort and Search
No ratings yet
Sort and Search
12 pages
Sorting 1
No ratings yet
Sorting 1
30 pages
ADMA2013 MaxSP Maximal Sequential Patterns
No ratings yet
ADMA2013 MaxSP Maximal Sequential Patterns
12 pages
Probabilistic Sequence Mining - Evaluation and Extension of Promfs Algorithm
No ratings yet
Probabilistic Sequence Mining - Evaluation and Extension of Promfs Algorithm
4 pages
Compusoft, 3 (9), 1079-1082 PDF
No ratings yet
Compusoft, 3 (9), 1079-1082 PDF
4 pages
Sorting
No ratings yet
Sorting
10 pages
DS - Unit 3
No ratings yet
DS - Unit 3
29 pages
Unit 8
0% (1)
Unit 8
24 pages
CS-250 Data Structures & Algorithms: Complexity of Search Algorithms
No ratings yet
CS-250 Data Structures & Algorithms: Complexity of Search Algorithms
60 pages
MCA Unit 3 - Searching Sorting Graph
No ratings yet
MCA Unit 3 - Searching Sorting Graph
13 pages
Efficient Mining of Top-K Sequential Rules: Abstract
No ratings yet
Efficient Mining of Top-K Sequential Rules: Abstract
14 pages
Bab 06 - Seq Mining - Part 2
No ratings yet
Bab 06 - Seq Mining - Part 2
26 pages
Insertion Sort - Bubble Sort
No ratings yet
Insertion Sort - Bubble Sort
10 pages
Combinatorial Studies: Slide 1 of 34
No ratings yet
Combinatorial Studies: Slide 1 of 34
72 pages
DSA Ch2
No ratings yet
DSA Ch2
44 pages
Dynamic Programing in Dsa
No ratings yet
Dynamic Programing in Dsa
32 pages
Unit 3
No ratings yet
Unit 3
69 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Speed Mathamatics
From Everand
Speed Mathamatics
Naila Hina
1/5 (1)
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
28 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
59 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
58 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
57 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
79 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
45 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
29 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
31 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
16 pages
Chapter 6: High-Dimensional Data
No ratings yet
Chapter 6: High-Dimensional Data
21 pages
Chapter 7: Dimensionality Reduction
No ratings yet
Chapter 7: Dimensionality Reduction
34 pages
Chapter 3: Categorical Attributes
No ratings yet
Chapter 3: Categorical Attributes
26 pages
Chapter 8: Itemset Mining
No ratings yet
Chapter 8: Itemset Mining
34 pages
Introduction of Data Science - Mahatma Gandhi Central University
No ratings yet
Introduction of Data Science - Mahatma Gandhi Central University
17 pages
Chapter 1: Data Mining and Analysis
No ratings yet
Chapter 1: Data Mining and Analysis
24 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
No ratings yet
High Precision Agriculture: An Application of Improved Machine-Learning Algorithms 2019
6 pages
Data Science in Agriculture Part I: Introduction
100% (1)
Data Science in Agriculture Part I: Introduction
2 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
R 2 Calculations
No ratings yet
R 2 Calculations
30 pages
Random Variables and Probability Distributions
100% (1)
Random Variables and Probability Distributions
15 pages
Vessel Design Calculation
No ratings yet
Vessel Design Calculation
22 pages
Iecep National Quiz Review
No ratings yet
Iecep National Quiz Review
12 pages
Biostatistics Notes For PG
No ratings yet
Biostatistics Notes For PG
10 pages
7.93 - Lecture #4 - and - Multiple Sequence Alignment: More Pairwise Sequence Comparisons
No ratings yet
7.93 - Lecture #4 - and - Multiple Sequence Alignment: More Pairwise Sequence Comparisons
44 pages
Analysis of Selected Mathematical Models of High-Cycle S-N Characteristics
No ratings yet
Analysis of Selected Mathematical Models of High-Cycle S-N Characteristics
15 pages
Chapter1 Existence and Uniqueness Theorems
No ratings yet
Chapter1 Existence and Uniqueness Theorems
13 pages
Applied III Chapter 1
No ratings yet
Applied III Chapter 1
51 pages
Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance
No ratings yet
Osborne (2008) CH 22 Testing The Assumptions of Analysis of Variance
29 pages
EDAN
No ratings yet
EDAN
2 pages
MATH 472: Numerical Methods With Financial Applications: Course Basics Fundamentals
No ratings yet
MATH 472: Numerical Methods With Financial Applications: Course Basics Fundamentals
38 pages
E2001 Circuit Analysis: Academic Year 2020-2021
No ratings yet
E2001 Circuit Analysis: Academic Year 2020-2021
15 pages
Ec0052 Digital Image Processing 2013 14
No ratings yet
Ec0052 Digital Image Processing 2013 14
7 pages
Predicate Logic PDF
No ratings yet
Predicate Logic PDF
22 pages
Hydrocarbon Reservoir Modeling Comparison Between Theoretical and Real Petrophysical Properties From The Namorado Field (Brazil) Case Study
No ratings yet
Hydrocarbon Reservoir Modeling Comparison Between Theoretical and Real Petrophysical Properties From The Namorado Field (Brazil) Case Study
17 pages
A Brief History of Feedback Control
No ratings yet
A Brief History of Feedback Control
20 pages
CREATING A HISTOGRAM, Etc
No ratings yet
CREATING A HISTOGRAM, Etc
2 pages
Tables For Processing Gcodes
No ratings yet
Tables For Processing Gcodes
33 pages
Sri Chaitanya: IIT Academy.,India
No ratings yet
Sri Chaitanya: IIT Academy.,India
11 pages
Triangles Report
No ratings yet
Triangles Report
10 pages
Ages
100% (1)
Ages
89 pages
Electrical First Term Allocation
No ratings yet
Electrical First Term Allocation
1 page
Investigation of The Force Extension Graph For A Spring
No ratings yet
Investigation of The Force Extension Graph For A Spring
3 pages
Complexity A Guided Tour
100% (11)
Complexity A Guided Tour
366 pages
Mastercam 2017 Readme
No ratings yet
Mastercam 2017 Readme
50 pages
Secondary 4 Mathematics: 〔 Whole Syllabus〕 Marking Scheme
No ratings yet
Secondary 4 Mathematics: 〔 Whole Syllabus〕 Marking Scheme
20 pages
Lyceum of Alabang Basic Education
No ratings yet
Lyceum of Alabang Basic Education
41 pages
10 General Aptitude - GQB (Ddpanda)
No ratings yet
10 General Aptitude - GQB (Ddpanda)
71 pages
PHP Type Comparison Tables
No ratings yet
PHP Type Comparison Tables
2 pages

Chapter 10: Sequence Mining

Uploaded by

Chapter 10: Sequence Mining

Uploaded by

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 10: Sequence Mining

Let Σ be the alphabet, a set of symbols. A sequence or a string is defined as an

Let s = s1 s2 . . . sn and r = r1 r2 . . . rm be two sequences over Σ. We say that r is a

Let Σ = {A, C , G , T }, and let s = ACTGAACG .

Given a database D = {s 1 , s 2 , . . . , s N } of N sequences, and given some sequence

The relative support of r is the fraction of sequences that contain r

rsup(r ) = sup(r )/N

Given a user-specified minsup threshold, we say that a sequence r is frequent in

Using minsup = 3, the set of frequent subsequences is given as:

F (1) = A(3), G (3), T (3)

GSP (D, Σ, minsup):

A(3) C (2) G (3) T (3)

AA(3) AG (3) AT (2) GA(3) GG (3) GT (2) TA(1) TG (1) TT (0)

AAGG GAAA GAAG (3) GAGA GAGG

Support computation in Spade is done via sequential join operations.

a tuple i, pos r b [k] ∈ L(r b ), we first check if there exists a tuple

AAG GAA GAG

Given a sequence s, we append a terminal character $ 6∈ Σ so that

(a) j = 1 (b) j = 2 (c) j = 3

(1,1) (1,3) (1,1) (1,1) (1,7)

(e) j = 5 (f) j = 6 (g) j = 7

(1,4) (1,6) (1,7)

(a) Full Tree

(b) Compressed Tree

Ukkonen’s method is an online algorithm, that is, given a string s = s1 s2 . . . sn $ it

(1,4) (1,2) (1,4) T (1,2)

(a) T6 (b) T7 , extensions j = 1, . . . , 4

C AGAAGT $, e = 1 C A GAAGT $, e = 2 CA G AAGT $, e = 3 CAG A AGT $, e = 4

(a) T1 (b) T2 (c) T3 (d) T4

CAGA A GT $, e = 5 CAGAA G T $, e = 6 CAGAAG T $, e = 7

Extensions 1–4 Extension 5: AGT Extension 6: GT

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 10: Sequence Mining

You might also like