Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
Concepts and
Techniques
Mining sequence patterns in transactional
databases
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
<a(bc)dc> is a
40
<eg(af)cbc>
subsequence of <a(abc)
(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is
a sequential pattern
30
<(ef)(ab)(df)cb>
Seq. ID
Sequence
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
Cand
Sup
<a>
<b>
<c>
<d>
<e>
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
<f>
30
<(ah)(bf)abf>
<g>
40
<(be)(ce)d>
<h>
50
<a(bd)bcb(ade)>
51 length-2
Candidates
<a>
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<b>
<c>
<d>
<e>
<f>
<a>
<aa>
<ab>
<ac>
<ad>
<ae>
<af>
<b>
<ba>
<bb>
<bc>
<bd>
<be>
<bf>
<c>
<ca>
<cb>
<cc>
<cd>
<ce>
<cf>
<d>
<da>
<db>
<dc>
<dd>
<de>
<df>
<e>
<ea>
<eb>
<ec>
<ed>
<ee>
<ef>
<f>
<fa>
<fb>
<fc>
<fd>
<fe>
<ff>
<b>
<c>
<d>
<e>
<f>
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<(ef)>
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
Cand. cannot
pass sup.
threshold
Cand. not in DB at
<abba> <(bd)bc>
all
<(bd)cba>
min_sup
=2
Seq. ID
Sequence
10
<(bd)cb(ac)>
20
<(bf)(ce)b(fg)>
30
<(ah)(bf)abf>
40
<(be)(ce)d>
50
<a(bd)bcb(ade)>
Candidate Generate-and-test:
Drawbacks
Breadth-first search
100 30100
A length-100 sequential pattern needs 10
candidate sequences!
i 1
2100 1 1030
Prefix
<a>
<(abc)(ac)d(cf)>
<aa>
<(_bc)(ac)d(cf)>
<ab>
<(_c)(ac)d(cf)>
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
Completeness of PrefixSpan
SDB
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
<a>-projected database
<b>-projected database
<(abc)(ac)d(cf)>
Length-2 sequential
patterns
<(_d)c(bc)(ae)>
<aa>, <ab>, <(ab)>,
<(_b)(df)cb>
<ac>, <ad>, <af>
<(_f)cbc>
<af>-proj. db
Efficiency of PrefixSpan
Speed-up by Pseudoprojection
s=<a(abc)(ac)d(cf)
<a>
Suggested Approach:
Sets of Sequences:
{{<i , i >, , <i , i , i >}, }
1
2
m n
k
Serial episodes: A B
Periodicity Analysis