0% found this document useful (0 votes)
191 views26 pages

Concepts and Techniques: Mining Sequence Patterns in Transactional Databases

The document discusses sequential pattern mining. It begins by defining sequential patterns and sequence databases. Challenges in sequential pattern mining include the huge number of possible patterns and the need for efficient and scalable algorithms. The document then describes the Apriori-based GSP algorithm and its generate-and-test approach. It also introduces the vertical format-based SPADE algorithm. Finally, it discusses more advanced techniques like pattern-growth methods such as PrefixSpan that avoid candidate generation through database projections.

Uploaded by

081325296516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views26 pages

Concepts and Techniques: Mining Sequence Patterns in Transactional Databases

The document discusses sequential pattern mining. It begins by defining sequential patterns and sequence databases. Challenges in sequential pattern mining include the huge number of possible patterns and the need for efficient and scalable algorithms. The document then describes the Apriori-based GSP algorithm and its generate-and-test approach. It also introduces the vertical format-based SPADE algorithm. Finally, it discusses more advanced techniques like pattern-growth methods such as PrefixSpan that avoid candidate generation through database projections.

Uploaded by

081325296516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Mining:

Concepts and
Techniques
Mining sequence patterns in transactional
databases

Sequence Databases & Sequential


Patterns

Transaction databases, time-series databases vs.


sequence databases

Frequent patterns vs. (frequent) sequential patterns

Applications of sequential pattern mining

Customer shopping sequences:

First buy computer, then CD-ROM, and then digital


camera, within 3 months.

Medical treatments, natural disasters (e.g.,


earthquakes), science & eng. processes, stocks and
markets, etc.

Telephone calling patterns, Weblog click streams

DNA sequences and gene structures

What Is Sequential Pattern


Mining?

Given a set of sequences, find the complete


set of frequent subsequences

A sequence database
SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

A sequence : < (ef) (ab) (df) c b


>

An element may contain a set of item


Items within an element are unordere
and we list them alphabetically.

<a(bc)dc> is a
40
<eg(af)cbc>
subsequence of <a(abc)
(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is
a sequential pattern
30

<(ef)(ab)(df)cb>

Challenges on Sequential Pattern


Mining

A huge number of possible sequential patterns are


hidden in databases

A mining algorithm should

find the complete set of patterns, when


possible, satisfying the minimum support
(frequency) threshold

be highly efficient, scalable, involving only a


small number of database scans

be able to incorporate various kinds of userspecific constraints

Sequential Pattern Mining


Algorithms

Concept introduction and an initial Apriori-like algorithm

Agrawal & Srikant. Mining sequential patterns, ICDE95

Apriori-based method: GSP (Generalized Sequential Patterns:


Srikant & Agrawal @ EDBT96)

Pattern-growth methods: FreeSpan & PrefixSpan (Han et


al.@KDD00; Pei, et al.@ICDE01)

Vertical format-based mining: SPADE (Zaki@Machine


Leanining00)

Constraint-based sequential pattern mining (SPIRIT: Garofalakis,


Rastogi, Shim@VLDB99; Pei, Han, Wang @ CIKM02)

Mining closed sequential patterns: CloSpan (Yan, Han & Afshar


@SDM03)

The Apriori Property of Sequential


Patterns

A basic property: Apriori (Agrawal & Sirkant94)

If a sequence S is not frequent

Then none of the super-sequences of S is frequent

E.g, <hb> is infrequent so do <hab> and <(ah)b>

Seq. ID

Sequence

10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

Given support threshold


min_sup =2

GSPGeneralized Sequential Pattern Mining

GSP (Generalized Sequential Pattern) mining algorithm


proposed by Agrawal and Srikant, EDBT96
Outline of the method
Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each
candidate sequence
generate candidate length-(k+1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate
can be found
Major strength: Candidate pruning by Apriori

Finding Length-1 Sequential Patterns

Examine GSP using an example


Initial candidates: all singleton
sequences
<a>, <b>, <c>, <d>, <e>, <f>,
<g>, <h>
Scan database once, count support
for candidates min_sup
=2
Seq. ID
Sequence

Cand

Sup

<a>

<b>

<c>

<d>

<e>

10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

<f>

30

<(ah)(bf)abf>

<g>

40

<(be)(ce)d>

<h>

50

<a(bd)bcb(ade)>

GSP: Generating Length-2


Candidates

51 length-2
Candidates

<a>
<a>
<b>
<c>
<d>
<e>
<f>

<a>

<b>

<c>

<d>

<e>

<f>

<a>

<aa>

<ab>

<ac>

<ad>

<ae>

<af>

<b>

<ba>

<bb>

<bc>

<bd>

<be>

<bf>

<c>

<ca>

<cb>

<cc>

<cd>

<ce>

<cf>

<d>

<da>

<db>

<dc>

<dd>

<de>

<df>

<e>

<ea>

<eb>

<ec>

<ed>

<ee>

<ef>

<f>

<fa>

<fb>

<fc>

<fd>

<fe>

<ff>

<b>

<c>

<d>

<e>

<f>

<(ab)>

<(ac)>

<(ad)>

<(ae)>

<(af)>

<(bc)>

<(bd)>

<(be)>

<(bf)>

<(cd)>

<(ce)>

<(cf)>

<(de)>

<(df)>
<(ef)>

Without Apriori
property,
8*8+8*7/2=92
candidates

Apriori prunes
44.57% candidates

The GSP Mining Process


5th scan: 1 cand. 1 length-5
seq. pat.

Cand. cannot
pass sup.
threshold
Cand. not in DB at
<abba> <(bd)bc>
all
<(bd)cba>

4th scan: 8 cand. 6 length-4


seq. pat.
3rd scan: 46 cand. 19 length-3 <abb> <aab> <aba> <baa> <bab>
seq. pat. 20 cand. not in DB at
all
2nd scan: 51 cand. 19 length-2
seq. pat. 10 cand. not in DB at <aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(e
all
1st scan: 8 cand. 6 length-1
<a> <b> <c> <d> <e> <f> <g> <h>
seq. pat.

min_sup
=2

Seq. ID

Sequence

10

<(bd)cb(ac)>

20

<(bf)(ce)b(fg)>

30

<(ah)(bf)abf>

40

<(be)(ce)d>

50

<a(bd)bcb(ade)>

Candidate Generate-and-test:
Drawbacks

A huge set of candidate sequences generated.

Especially 2-item candidate sequence.

Multiple Scans of database needed.

The length of each candidate grows by one at


each database scan.

Inefficient for mining long sequential patterns.

A long pattern grow up from short patterns

The number of short patterns is exponential to


the length of mined patterns.

The SPADE Algorithm

SPADE (Sequential PAttern Discovery using


Equivalent Class) developed by Zaki 2001

A vertical format sequential pattern mining method

A sequence database is mapped to a large set of

Item: <SID, EID>

Sequential pattern mining is performed by

growing the subsequences (patterns) one item


at a time by Apriori candidate generation

The SPADE Algorithm

Bottlenecks of GSP and SPADE

A huge set of candidates could be generated

1,000 frequent length-1 sequences generate s huge


number of length-2 candidates! 1000 999
1000 1000
1,499,500
2

Multiple scans of database in mining

Breadth-first search

Mining long sequential patterns

Needs an exponential number of short candidates

100 30100
A length-100 sequential pattern needs 10

candidate sequences!


i 1

2100 1 1030

Prefix and Suffix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> are prefixes


of sequence <a(abc)(ac)d(cf)>

Given sequence <a(abc)(ac)d(cf)>

Prefix

Suffix (Prefix-Based Projection)

<a>

<(abc)(ac)d(cf)>

<aa>

<(_bc)(ac)d(cf)>

<ab>

<(_c)(ac)d(cf)>

Mining Sequential Patterns by Prefix


Projections

Step 1: find length-1 sequential patterns


<a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of
seq. pat. can be partitioned into 6 subsets:
The ones having prefix <a>;
The ones having prefix <b>;
SID
sequence

10
<a(abc)(ac)d(cf)>
The ones having prefix <f>
20
<(ad)c(bc)(ae)>
30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

Finding Seq. Patterns with Prefix


<a>

Only need to consider projections w.r.t. <a>

<a>-projected database: <(abc)(ac)d(cf)>,


<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>:


<aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Further partition into 6 subsets

SID

sequence

Having prefix <aa>;

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

Having prefix <af>

Completeness of PrefixSpan
SDB

Having prefix <a>

SID

sequence

10

<a(abc)(ac)d(cf)>

20

<(ad)c(bc)(ae)>

30

<(ef)(ab)(df)cb>

40

<eg(af)cbc>

Length-1 sequential patterns


<a>, <b>, <c>, <d>, <e>, <f>

Having prefix <c>, , <f>

Having prefix <b>

<a>-projected database
<b>-projected database
<(abc)(ac)d(cf)>
Length-2 sequential
patterns
<(_d)c(bc)(ae)>
<aa>, <ab>, <(ab)>,
<(_b)(df)cb>
<ac>, <ad>, <af>
<(_f)cbc>

Having prefix <aa>Having prefix <af>


<aa>-proj. db

<af>-proj. db

Efficiency of PrefixSpan

No candidate sequence needs to be


generated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing


projected databases

Can be improved by pseudo-projections

Speed-up by Pseudoprojection

Major cost of PrefixSpan: projection

Postfixes of sequences often appear


repeatedly in recursive projected
databases

When (projected) database can be held in


main memory, use pointers to form
projections

s=<a(abc)(ac)d(cf)
<a>

Pointer to the sequence


s|<a>: ( , 2) <(abc)(ac)d(cf)>
Offset of the postfix
<ab>
s|<ab>: ( , 4) <(_c)(ac)d(cf)>

Pseudo-Projection vs. Physical


Projection

Pseudo-projection avoids physically copying


postfixes

However, it is not efficient when database


cannot fit in main memory

Efficient in running time and space when


database can be held in main memory

Disk-based random accessing is very costly

Suggested Approach:

Integration of physical and pseudo-projection

Swapping to pseudo-projection when the


data set fits in memory

Constraint-Based Seq.-Pattern Mining

Constraint-based sequential pattern mining


Constraints: User-specified, for focused mining of
desired patterns
How to explore efficient mining with constraints?
Optimization
Classification of constraints
Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10
Monotone: E.g., count (S) > 5, S {PC,
digital_camera}
Succinct: E.g., length(S) 10, S {Pentium, MS/Office,
MS/Money}
Convertible: E.g., value_avg(S) < 25, profit_sum (S) >
160, max(S)/avg(S) < 2, median(S) min(S) > 5
Inconvertible: E.g., avg(S) median(S) = 0

From Sequential Patterns to Structured


Patterns

Sets, sequences, trees, graphs, and other structures


Transaction DB: Sets of items
{{i , i , , i }, }
1
2
m

Seq. DB: Sequences of sets:


{<{i , i }, , {i , i , i }>, }
1
2
m n
k

Sets of Sequences:
{{<i , i >, , <i , i , i >}, }
1
2
m n
k

Sets of trees: {t1, t2, , tn}


Sets of graphs (mining for frequent subgraphs):
{g , g , , g }
1
2
n

Mining structured patterns in XML documents, biochemical structures, etc.

Episodes and Episode Pattern Mining

Other methods for specifying the kinds of patterns

Serial episodes: A B

Parallel episodes: A & B

Regular expressions: (A | B)C*(D E)

Methods for episode pattern mining

Variations of Apriori-like algorithms, e.g., GSP

Database projection-based pattern growth

Similar to the frequent pattern growth without


candidate generation

Periodicity Analysis

Periodicity is everywhere: tides, seasons, daily power


consumption, etc.
Full periodicity
Every point in time contributes (precisely or
approximately) to the periodicity
Partial periodicit: A more general notion
Only some segments contribute to the periodicity

Jim reads NY Times 7:00-7:30 am every week day


Cyclic association rules
Associations which form cycles
Methods
Full periodicity: FFT, other statistical analysis
methods
Partial and cyclic periodicity: Variations of Apriori-like
mining methods

Ref: Mining Sequential


Patterns

R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and


performance improvements. EDBT96.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine
Learning, 2001.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining
Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01
(TKDE04).
J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large
Databases, CIKM'02.
X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04.
H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in
Large Database, KDD'04.
J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series
Database, ICDE'99.
J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series
data, KDD'00.

You might also like