PrefixSpan The Presentation
PrefixSpan The Presentation
Philippe Fournier-Viger
http://www.philippe-Fournier-viger.com
Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A
Survey of Sequential Pattern Mining. Data Science and Pattern Recognition
(DSPR), vol. 1(1), pp. 54-77.
2
What is a discrete sequence?
A sequence is an ordered list of symbols.
3
What is a discrete sequence?
A sequence is an ordered list of symbols.
I go back home
4
What is a discrete sequence?
A sequence is an ordered list of symbols.
a b f g
a b c d
e f g h
5
Sequential Pattern Mining
• It is a popular data mining task, introduced in 1994
by Agrawal & Srikant.
• The goal is to find all subsequences that appear
frequently in a set of discrete sequences.
• For example:
– find sequences of items purchased by many customers
over time,
– find sequences of locations frequently visited by
tourists in a city,
– Find sequences of words that appear frequently in a
text.
6
Definition: Items
Let there be a set of items (symbols) called 𝐼.
Example: 𝐼 = {𝑎, 𝑏, 𝑐, 𝑑, 𝑒}
𝑎 = apple 𝑑 = dattes
𝑏 = bread 𝑒 = eggs
𝑐 = cake
7
Definition: Itemset
An itemset is a set of items that is a subset of 𝐼.
Example 2: ⟨ 𝑎 , 𝑎 , {𝑐}⟩
9
Definition: Subsequence (⊑)
Let there be two sequences:
𝑆𝐴 = 𝐴1 , 𝐴2 , … , 𝐴𝑟 and S𝐵 = 𝐵1 , 𝐵2 , … , 𝐵𝑡 .
The sequence 𝑆𝐴 is a subsequence of S𝐵 if and only
if there exists 𝑟 integers 1 ≤ 𝑖1 < 𝑖2 < ⋯ < 𝑖𝑟 ≤ 𝑡
such that 𝐴1 ⊆ 𝐵𝑖1 , 𝐴2 ⊆ 𝐵𝑖2 , … 𝐴𝑟 ⊆ 𝐵𝑖𝑟 .
This is denoted as SA ⊑ 𝑆𝐵
Examples: ⟨ 𝑎, 𝑐 ⟩ ⊑ ⟨ 𝑎, 𝑏, 𝑐 ⟩
𝑎, 𝑐 ⊑ ⟨ 𝑎}, {𝑐 ⟩
⟨ 𝑎 , 𝑐 ⟩ ⊑ ⟨ 𝑎, 𝑏 , {𝑑}, 𝑏, 𝑐 ⟩
⟨ 𝑎 , 𝑐 ⟩ ⊑ ⟨ 𝑎, 𝑐 , {𝑑}⟩
10
Definition: Sequence database
A sequence database 𝐷 is a set of discrete
sequences 𝐷 = {𝑆1 , 𝑆2 , … 𝑆𝑚 } where each
sequence 𝑆𝑗 ∈ 𝐷 has a unique identifier 𝑗.
Example 1:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨ 𝑎 ⟩) = 3
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
12
Definition: Support of a sequence
The number of sequences in a sequence
database 𝐷 that contain a sequence 𝑆𝐴 is called
the support of 𝑆𝐴 . It is defined as:
𝑠𝑢𝑝(𝑆𝐴 ) = | 𝑆 𝑆 ∈ 𝐷 𝑎𝑛𝑑 𝑆𝐴 ⊑ 𝑆}|
Example 2:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨ 𝑏 ⟩) = 4
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
13
Definition: Support of a sequence
The number of sequences in a sequence
database 𝐷 that contain a sequence 𝑆𝐴 is called
the support of 𝑆𝐴 . It is defined as:
𝑠𝑢𝑝(𝑆𝐴 ) = | 𝑆 𝑆 ∈ 𝐷 𝑎𝑛𝑑 𝑆𝐴 ⊑ 𝑆}|
Example 3:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨{𝑎}, {𝑏}⟩ = 1
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
14
Definition: Support of a sequence
The number of sequences in a sequence
database 𝐷 that contain a sequence 𝑆𝐴 is called
the support of 𝑆𝐴 . It is defined as:
𝑠𝑢𝑝(𝑆𝐴 ) = | 𝑆 𝑆 ∈ 𝐷 𝑎𝑛𝑑 𝑆𝐴 ⊑ 𝑆}|
Example 4:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨ 𝑎, 𝑏 ⟩) = 2
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
15
Definition: Sequential pattern mining
• Input: A sequence database 𝐷 and a
minimum support threshold 𝑚𝑖𝑛𝑠𝑢𝑝 > 0.
• Output: All sequential patterns.
A sequential pattern is a sequence 𝑆 where
sup 𝑆 ≥ 𝑚𝑖𝑛𝑠𝑢𝑝.
16
Example 1
INPUT: OUTPUT:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
17
Example 1
INPUT: OUTPUT:
18
Example 2
INPUT: OUTPUT:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 4
𝑚𝑖𝑛𝑠𝑢𝑝 = 4
Kosarak BMS
Leviathan Snake
23
The “Apriori” property
Property (anti-monotonicity).
Let be two subsequences X and Y. If X ⊑ 𝐘, then the
support of Y is less than or equal to the support of X.
Example
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ The support of 𝑏 is 4
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ The support of 𝑏 , 𝑐 is 4
The support of 𝑏 , 𝑐 , {𝑑} is 1
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
24
THE PREFIXSPAN ALGORITHM
25
The PrefixSpan algorithm
• Proposed by Jian Pei et al (2001)
• This algorithm is designed to only consider
patterns that exist in the database.
• This algorithm uses a concept of database
projection and a depth-first search.
• This is not the most efficient algorithm, but it
is simple and easy to extend, so it is popular.
• I will explain with an example.
26
Example
This is the input:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
27
Step 1
PrefixSpan first counts the support of each item by scanning the
database:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
28
Step 1
PrefixSpan first counts the support of each item by scanning the
database:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑑 ⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
29
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑑 ⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
30
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
31
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
Those are the sequential
patterns containing one item!
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
32
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
Those are the sequential
patterns containing one item!
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
Prefixspan then extends each
item recursively…
Lets start with ⟨ 𝑎 ⟩ →
33
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
34
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
39
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩
40
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩
Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
Result:
⟨ 𝑎 , {𝑎}⟩ support : 1
⟨ 𝑎 , {𝑏}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 ⟨ 𝑎 , {𝑐}⟩ support: 3
𝑎, 𝑏 support : 3
41
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩
Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
Result:
42
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩
Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
Result:
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
44
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
45
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
46
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
47
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩
Result:
⟨ 𝑎 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
48
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩
Result:
⟨ 𝑎 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
This pattern is infrequent!
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
50
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
51
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
52
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
53
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
54
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩
Result:
⟨ 𝑎, 𝑏 , {𝑐}⟩ support : 3
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑎, 𝑏 , {𝑎}⟩ support : 1
⟨ 𝑎, 𝑏 , {𝑏}⟩ support : 1
55
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩
Result:
⟨ 𝑎, 𝑏 , {𝑐}⟩ support : 3
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑎, 𝑏 , {𝑎}⟩ support : 1
⟨ 𝑎, 𝑏 , {𝑏}⟩ support : 1
56
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩
Result:
⟨ 𝑎, 𝑏 , {𝑐}⟩ support : 3
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
58
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
59
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
60
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
61
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}
Result:
⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
62
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}
Result:
⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
64
Step 7 – Find patterns starting with {𝑏}
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
65
Step 7 – Find patterns starting with {𝑏}
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
66
Step 7 – Find patterns starting with {𝑏}
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
67
Step 7 – Find patterns starting with {𝑏}
Result:
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑏 , {𝑎}⟩ support : 2
⟨ 𝑏 , {𝑏}⟩ support : 2
⟨ 𝑏 , {𝑐}⟩ support : 3
⟨ 𝑏 , {𝑑}⟩ support : 1
68
Step 7 – Find patterns starting with {𝑏}
Result:
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑏 , {𝑎}⟩ support : 2
⟨ 𝑏 , {𝑏}⟩ support : 2
⟨ 𝑏 , {𝑐}⟩ support : 3
⟨ 𝑏 , {𝑑}⟩ support : 1
Then, PrefixSpan tries to find patterns starting 69
with ⟨ 𝑏 , {𝑐}⟩ →
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
70
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
71
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
72
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
73
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩
Result:
⟨ 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑏 , 𝑐 , {𝑑} support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
74
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩
Result:
⟨ 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑏 , 𝑐 , {𝑑} support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
Notation:
Frequent sequential pattern
Infrequent sequential pattern 77
Observation
PrefixSpan performs a depth-first search:
⟨⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 78
Observation
PrefixSpan performs a depth-first search:
⟨⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 79
Observation
PrefixSpan performs a depth-first search:
⟨⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 80
Observation
PrefixSpan performs a depth-first search:
⟨⟩
⟨ 𝑎 , 𝑐 , {𝑎}⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 81
Observation
PrefixSpan performs a depth-first search:
⟨⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 82
Observation
PrefixSpan performs a depth-first search:
⟨⟩
⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 83
Observation
PrefixSpan performs a depth-first search:
⟨⟩
⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 84
Observation
PrefixSpan performs a depth-first search:
⟨⟩
⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩
Notation:
Frequent sequential pattern
Infrequent sequential pattern 85
Pseudocode of PrefixSpan (simple version)
86
Optimization 1
• Observation:
– Making a copy of the database for each projection can spend a lot of time!
– A projected database can also take a lot of memory.
• Solution:
– do pseudo-projections
– This means that we don’t make a real copy. We use pointers on the original
database instead.
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
87
Optimization 1
• Observation:
– Making a copy of the database for each projection can spend a lot of time!
– A projected database can also take a lot of memory.
• Solution:
– do pseudo-projections
– This means that we don’t make a real copy. We use pointers on the original
database instead. Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
Sequence database 𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑆4 = ⟨ _𝑏 , {𝑐}⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ Pseudo-projected database of ⟨ 𝑎 ⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ 𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
88
Optimization 2
• Observation:
– After reading the database to count the support of each item,
PrefixSpan can remove all infrequent items from the database.
– This will reduce the database size…
– This could be done also when creating projected databases.
Sequence database
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝒅}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
89
Optimization 2
• Observation:
– After reading the database to count the support of each item,
PrefixSpan can remove all infrequent items from the database.
– This will reduce the database size…
– This could be done also when creating projected databases.
Sequence database
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆3 = ⟨𝑏 , 𝑐 ⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
90
PrefixSpan is a good algorithm?
91
What influence the performance of
PrefixSpan?
92
Code, datasets and more…
• A fast Java implementation of PrefixSpan is available in the
SPMF data mining software
(http://www.philippe-fournier-viger.com/spmf/ )
– It can be used as a stand alone software, or as a library.
– Several other sequential pattern mining algorithms are
also provided.
– Datasets are given
• A survey of sequential pattern mining:
– Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R.
(2017). A Survey of Sequential Pattern Mining. Data Science and
Pattern Recognition (DSPR), vol. 1(1), pp. 54-77.
93