0% found this document useful (0 votes)
9 views93 pages

PrefixSpan The Presentation

The document introduces sequential pattern mining, a data mining task aimed at discovering frequently occurring subsequences in discrete sequences. It defines key concepts such as discrete sequences, itemsets, subsequences, and support, while also discussing the challenges and algorithms associated with mining these patterns. The document emphasizes the importance of efficient algorithms to handle the potentially vast number of sequential patterns in a database.

Uploaded by

vineetsuradkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views93 pages

PrefixSpan The Presentation

The document introduces sequential pattern mining, a data mining task aimed at discovering frequently occurring subsequences in discrete sequences. It defines key concepts such as discrete sequences, itemsets, subsequences, and support, while also discussing the challenges and algorithms associated with mining these patterns. The document emphasizes the importance of efficient algorithms to handle the potentially vast number of sequential patterns in a database.

Uploaded by

vineetsuradkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

An Introduction to

Sequential Pattern Mining

Philippe Fournier-Viger
http://www.philippe-Fournier-viger.com

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A
Survey of Sequential Pattern Mining. Data Science and Pattern Recognition
(DSPR), vol. 1(1), pp. 54-77.

Source code and datasets available in the SPMF library 1


Introduction
• Data Mining: the goal is to discover or extract
useful knowledge from data.
• Many types of data can be analyzed: graphs,
relational databases, time series, sequences,
etc.
• In this presentation, we focus on analyzing a
common type of data called discrete
sequences to find interesting patterns in it.

2
What is a discrete sequence?
A sequence is an ordered list of symbols.

Example 1: a sequence can be the items that are


purchased by a customer over time:

Computer Monitor Router

3
What is a discrete sequence?
A sequence is an ordered list of symbols.

Example 2: a sequence can be the list of words in a


sentence:

I go back home

4
What is a discrete sequence?
A sequence is an ordered list of symbols.

Example 3: a sequence can be the list of locations


visited by a car in a city

a b f g

a b c d

e f g h

5
Sequential Pattern Mining
• It is a popular data mining task, introduced in 1994
by Agrawal & Srikant.
• The goal is to find all subsequences that appear
frequently in a set of discrete sequences.
• For example:
– find sequences of items purchased by many customers
over time,
– find sequences of locations frequently visited by
tourists in a city,
– Find sequences of words that appear frequently in a
text.
6
Definition: Items
Let there be a set of items (symbols) called 𝐼.

Example: 𝐼 = {𝑎, 𝑏, 𝑐, 𝑑, 𝑒}

𝑎 = apple 𝑑 = dattes

𝑏 = bread 𝑒 = eggs

𝑐 = cake

7
Definition: Itemset
An itemset is a set of items that is a subset of 𝐼.

Example: {𝑎, 𝑏, 𝑐} is an itemset containing 3 items

{𝑑, 𝑒} is an itemset containing 2 items

• Note: an itemset cannot contain a same item twice.


8
• An itemset having 𝑘 items is called a k-itemset.
Definition: Sequence
A discrete sequence 𝑆 is a an ordered list of itemsets
𝑆 = 𝑋1 , 𝑋2 , … , 𝑋𝑛 where 𝑋𝑗 ⊆ 𝐼 for any 𝑗 ∈ {1,2. . 𝑛}

Example 1: ⟨ 𝑎, 𝑏 , 𝑐 ⟩ is a sequence containing two


itemsets.

It means that a customer purchased 𝑎𝑝𝑝𝑙𝑒 and


𝑏𝑟𝑒𝑎𝑑 at the same time and then purchased 𝑐𝑎𝑘𝑒.

Example 2: ⟨ 𝑎 , 𝑎 , {𝑐}⟩
9
Definition: Subsequence (⊑)
Let there be two sequences:
𝑆𝐴 = 𝐴1 , 𝐴2 , … , 𝐴𝑟 and S𝐵 = 𝐵1 , 𝐵2 , … , 𝐵𝑡 .
The sequence 𝑆𝐴 is a subsequence of S𝐵 if and only
if there exists 𝑟 integers 1 ≤ 𝑖1 < 𝑖2 < ⋯ < 𝑖𝑟 ≤ 𝑡
such that 𝐴1 ⊆ 𝐵𝑖1 , 𝐴2 ⊆ 𝐵𝑖2 , … 𝐴𝑟 ⊆ 𝐵𝑖𝑟 .

This is denoted as SA ⊑ 𝑆𝐵

Examples: ⟨ 𝑎, 𝑐 ⟩ ⊑ ⟨ 𝑎, 𝑏, 𝑐 ⟩
𝑎, 𝑐 ⊑ ⟨ 𝑎}, {𝑐 ⟩
⟨ 𝑎 , 𝑐 ⟩ ⊑ ⟨ 𝑎, 𝑏 , {𝑑}, 𝑏, 𝑐 ⟩
⟨ 𝑎 , 𝑐 ⟩ ⊑ ⟨ 𝑎, 𝑐 , {𝑑}⟩
10
Definition: Sequence database
A sequence database 𝐷 is a set of discrete
sequences 𝐷 = {𝑆1 , 𝑆2 , … 𝑆𝑚 } where each
sequence 𝑆𝑗 ∈ 𝐷 has a unique identifier 𝑗.

Example 1: This is a sequence database with


four sequences 𝐷 = {𝑆1 , 𝑆2 , 𝑆3 , 𝑆4 } :
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ 11
Definition: Support of a sequence
The number of sequences in a sequence
database 𝐷 that contain a sequence 𝑆𝐴 is called
the support of 𝑆𝐴 . It is defined as:
𝑠𝑢𝑝(𝑆𝐴 ) = | 𝑆 𝑆 ∈ 𝐷 𝑎𝑛𝑑 𝑆𝐴 ⊑ 𝑆}|

Example 1:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨ 𝑎 ⟩) = 3
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
12
Definition: Support of a sequence
The number of sequences in a sequence
database 𝐷 that contain a sequence 𝑆𝐴 is called
the support of 𝑆𝐴 . It is defined as:
𝑠𝑢𝑝(𝑆𝐴 ) = | 𝑆 𝑆 ∈ 𝐷 𝑎𝑛𝑑 𝑆𝐴 ⊑ 𝑆}|

Example 2:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨ 𝑏 ⟩) = 4
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
13
Definition: Support of a sequence
The number of sequences in a sequence
database 𝐷 that contain a sequence 𝑆𝐴 is called
the support of 𝑆𝐴 . It is defined as:
𝑠𝑢𝑝(𝑆𝐴 ) = | 𝑆 𝑆 ∈ 𝐷 𝑎𝑛𝑑 𝑆𝐴 ⊑ 𝑆}|

Example 3:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨{𝑎}, {𝑏}⟩ = 1
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
14
Definition: Support of a sequence
The number of sequences in a sequence
database 𝐷 that contain a sequence 𝑆𝐴 is called
the support of 𝑆𝐴 . It is defined as:
𝑠𝑢𝑝(𝑆𝐴 ) = | 𝑆 𝑆 ∈ 𝐷 𝑎𝑛𝑑 𝑆𝐴 ⊑ 𝑆}|

Example 4:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑠𝑢𝑝(⟨ 𝑎, 𝑏 ⟩) = 2
𝑆2 = ⟨𝑎 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
15
Definition: Sequential pattern mining
• Input: A sequence database 𝐷 and a
minimum support threshold 𝑚𝑖𝑛𝑠𝑢𝑝 > 0.
• Output: All sequential patterns.
A sequential pattern is a sequence 𝑆 where
sup 𝑆 ≥ 𝑚𝑖𝑛𝑠𝑢𝑝.

16
Example 1
INPUT: OUTPUT:

Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

17
Example 1
INPUT: OUTPUT:

Sequence database all sequential patterns:


𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑎 support = 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ 𝑏 support = 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ 𝑐 support = 4
𝑎 , {𝑐} support = 3
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑎, 𝑏 support = 2
𝑏 , {𝑐} support = 4
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 𝑎, 𝑏 , {𝑐} support = 3

What will happen if we change the threshold? →

18
Example 2
INPUT: OUTPUT:

Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 4

Observation: If we increase the minsup


threshold, less patterns may be found
19
Example 2
INPUT: OUTPUT:

Sequence database all sequential patterns:


𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑏 support = 4
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ 𝑐 support = 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ 𝑏 , {𝑐} support = 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 4

Observation: If we increase the minsup


threshold, less patterns may be found
20
It is a difficult problem!
• A naïve algorithm would read the database and count the
support (frequency) of all possible patterns.
• Inefficient because there can be a very large number of
sequential patterns.
• For example:
⟨ 𝑎 ⟩, ⟨ 𝑏 ⟩, ⟨ 𝑐 ⟩ ….
….
𝑎, 𝑏 , 𝑎, 𝑐 , 𝑎, 𝑑 …

𝑎 , 𝑎 , 𝑎 , 𝑎 , 𝑎 , 𝑎 , 𝑎 , 𝑎 , 𝑎 … . 𝑎, 𝑏 𝑎 ,….
𝑎}, {𝑏 𝑎 ,….
….
• An efficient algorithm must find the frequent sequential
patterns, without checking all possibilities. 21
Some popular algorithms
• GSP: R. Agrawal, and R. Srikant, Mining sequential patterns, ICDE 1995, pp. 3–14,
1995.
• SPAM: Ayres, J. Flannick, J. Gehrke, and T. Yiu, Sequential pattern mining using a
bitmap representation, KDD 2002, pp. 429–435, 2002.
• SPADE: M. J. Zaki, SPADE: An efficient algorithm for mining frequent sequences,
Machine learning, vol. 42(1-2), pp. 31–60, 2001.
• PrefixSpan: J. Pei, et al. Mining sequential patterns by pattern-growth: The
prefixspan approach, IEEE Transactions on knowledge and data engineering, vol.
16(11), pp. 1424–1440, 2004.
• CM-SPAM and CM-SPADE: P. Fournier-Viger, A. Gomariz, M. Campos, and R.
Thomas, Fast Vertical Mining of Sequential Patterns Using Co-occurrence
Information, PAKDD 2014, pp. 40–52, 2014.

They all have the same input and output.


The difference is performance due to optimizations, search strategies and data structures!

Fast implementations available in the SPMF library


22
A performance comparison
Four benchmark datasets are used

Kosarak BMS

Leviathan Snake

23
The “Apriori” property
Property (anti-monotonicity).
Let be two subsequences X and Y. If X ⊑ 𝐘, then the
support of Y is less than or equal to the support of X.

Example
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ The support of 𝑏 is 4
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ The support of 𝑏 , 𝑐 is 4
The support of 𝑏 , 𝑐 , {𝑑} is 1
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

24
THE PREFIXSPAN ALGORITHM

PrefixSpan: J. Pei, et al. Mining sequential patterns by pattern-growth:


The prefixspan approach, IEEE Transactions on knowledge and data
engineering, vol. 16(11), pp. 1424–1440, 2004.

25
The PrefixSpan algorithm
• Proposed by Jian Pei et al (2001)
• This algorithm is designed to only consider
patterns that exist in the database.
• This algorithm uses a concept of database
projection and a depth-first search.
• This is not the most efficient algorithm, but it
is simple and easy to extend, so it is popular.
• I will explain with an example.

26
Example
This is the input:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

27
Step 1
PrefixSpan first counts the support of each item by scanning the
database:
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

28
Step 1
PrefixSpan first counts the support of each item by scanning the
database:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑑 ⟩ support : 1

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

29
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑑 ⟩ support : 1

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

30
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

31
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
Those are the sequential
patterns containing one item!
𝑚𝑖𝑛𝑠𝑢𝑝 = 3

32
Step 2
PrefixSpan eliminates infrequent items:
Sequence database
Result:
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
⟨ 𝑎 ⟩ support : 3
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ ⟨ 𝑏 ⟩ support : 4
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ ⟨ 𝑐 ⟩ support : 4
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
Those are the sequential
patterns containing one item!
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
Prefixspan then extends each
item recursively…
Lets start with ⟨ 𝑎 ⟩ →
33
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

34
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

PrefixSpan does a database projection with ⟨ 𝑎 ⟩:


Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
What is a database projection?
It means to keep only the
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 sequences containing 𝑎 .

Moreover, for these sequences, we


delete the first occurrence of⟨ 𝑎 ⟩
and everything that appears
before.
35
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

PrefixSpan does a database projection with ⟨ 𝑎 ⟩:


Sequence database Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ 𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ 𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ 𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
What is a database projection?
It means to keep only the
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 sequences containing 𝑎 .

Moreover, for these sequences, we


delete the first occurrence of⟨ 𝑎 ⟩
and everything that appears
before.
36
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

PrefixSpan does a database projection with ⟨ 𝑎 ⟩:


Sequence database Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ 𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ 𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ 𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
What is a database projection?
It means to keep only the
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 sequences containing 𝑎 .

Moreover, for these sequences, we


delete the first occurrence of⟨ 𝑎 ⟩
and everything that appears
before.
37
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

PrefixSpan does a database projection with ⟨ 𝑎 ⟩:


Sequence database Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ 𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ 𝑆4 = ⟨ _𝑏 , {𝑐}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
What is a database projection?
It means to keep only the
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 sequences containing 𝑎 .

Moreover, for these sequences, we


delete the first occurrence of⟨ 𝑎 ⟩
and everything that appears
before.
38
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with ⟨ 𝑎 ⟩ that has one more item:
Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

39
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with ⟨ 𝑎 ⟩ that has one more item:
Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
Result:
⟨ 𝑎 , {𝑎}⟩ support : 1
⟨ 𝑎 , {𝑏}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 ⟨ 𝑎 , {𝑐}⟩ support: 3
𝑎, 𝑏 support : 3

40
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

Then, infrequent patterns are removed:

Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
Result:
⟨ 𝑎 , {𝑎}⟩ support : 1
⟨ 𝑎 , {𝑏}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3 ⟨ 𝑎 , {𝑐}⟩ support: 3
𝑎, 𝑏 support : 3

41
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

Then, infrequent patterns are removed:

Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
Result:

𝑚𝑖𝑛𝑠𝑢𝑝 = 3 ⟨ 𝑎 , {𝑐}⟩ support: 3


𝑎, 𝑏 support : 3

42
Step 3 – Find patterns starting with ⟨ 𝑎 ⟩

Then, infrequent patterns are removed:

Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
Result:

𝑚𝑖𝑛𝑠𝑢𝑝 = 3 ⟨ 𝑎 , {𝑐}⟩ support: 3


𝑎, 𝑏 support : 3

Prefixspan then extends each pattern recursively…


Lets start with ⟨ 𝑎 , {𝑐}⟩ → 43
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩

PrefixSpan does a database projection with ⟨ 𝑎 , 𝑐 ⟩:


Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

44
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩

PrefixSpan does a database projection with ⟨ 𝑎 , 𝑐 ⟩:


Projected database of ⟨ 𝑎 ⟩
Projected database of ⟨ 𝑎 , {𝑐}⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

45
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩

PrefixSpan does a database projection with ⟨ 𝑎 , 𝑐 ⟩:


Projected database of ⟨ 𝑎 ⟩
Projected database of ⟨ 𝑎 , {𝑐}⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

46
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with 𝑎 , 𝑐 ⟩ that has one more item:
Projected database of ⟨ 𝑎 , {𝑐}⟩
𝑆1 = ⟨ 𝑎 ⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

47
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with 𝑎 , 𝑐 ⟩ that has one more item:
Projected database of ⟨ 𝑎 , {𝑐}⟩
𝑆1 = ⟨ 𝑎 ⟩

Result:
⟨ 𝑎 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3

48
Step 4 – Find patterns starting with ⟨ 𝑎 , {𝑐}⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with 𝑎 , 𝑐 ⟩ that has one more item:
Projected database of ⟨ 𝑎 , {𝑐}⟩
𝑆1 = ⟨ 𝑎 ⟩

Result:
⟨ 𝑎 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
This pattern is infrequent!

Then PrefixSpan try to find


patterns starting with ⟨{𝑎, 𝑏}⟩ →
49
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

PrefixSpan does a database projection with ⟨{𝑎, 𝑏}⟩:


Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

50
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

PrefixSpan does a database projection with ⟨{𝑎, 𝑏}⟩:


Projected database of ⟨ 𝑎 ⟩
Projected database of ⟨{𝑎, 𝑏}⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

51
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

PrefixSpan does a database projection with ⟨{𝑎, 𝑏}⟩:


Projected database of ⟨ 𝑎 ⟩
Projected database of ⟨{𝑎, 𝑏}⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ _𝑏 , {𝑐}⟩
𝑆4 = ⟨ {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

52
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

PrefixSpan does a database projection with ⟨{𝑎, 𝑏}⟩:


Projected database of ⟨{𝑎, 𝑏}⟩
𝑆1 = ⟨ 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

53
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with ⟨{𝑎, 𝑏}⟩ that has one more item:
Projected database of ⟨{𝑎, 𝑏}⟩
𝑆1 = ⟨ 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

54
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with ⟨{𝑎, 𝑏}⟩ that has one more item:
Projected database of ⟨{𝑎, 𝑏}⟩
𝑆1 = ⟨ 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ {𝑐}⟩

Result:
⟨ 𝑎, 𝑏 , {𝑐}⟩ support : 3
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑎, 𝑏 , {𝑎}⟩ support : 1
⟨ 𝑎, 𝑏 , {𝑏}⟩ support : 1

55
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

Then, PrefixSpan removes infrequent patterns:

Projected database of ⟨{𝑎, 𝑏}⟩


𝑆1 = ⟨ 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ {𝑐}⟩

Result:
⟨ 𝑎, 𝑏 , {𝑐}⟩ support : 3
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑎, 𝑏 , {𝑎}⟩ support : 1
⟨ 𝑎, 𝑏 , {𝑏}⟩ support : 1

56
Step 5 – Find patterns starting with ⟨ 𝑎, 𝑏 ⟩

Then, PrefixSpan removes infrequent patterns:

Projected database of ⟨{𝑎, 𝑏}⟩


𝑆1 = ⟨ 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ {𝑐}⟩

Result:
⟨ 𝑎, 𝑏 , {𝑐}⟩ support : 3
𝑚𝑖𝑛𝑠𝑢𝑝 = 3

Then PrefixSpan try to find patterns


starting with ⟨ 𝑎, 𝑏 , {𝑐}⟩ → 57
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}

PrefixSpan does a database projection for ⟨ 𝑎, 𝑏 , {𝑐}⟩:

Projected database of ⟨{𝑎, 𝑏}⟩


𝑆1 = ⟨ 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

58
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}

PrefixSpan does a database projection for ⟨ 𝑎, 𝑏 , {𝑐}⟩:

Projected database of ⟨{𝑎, 𝑏}⟩ Projected database of ⟨ 𝑎, 𝑏 , {𝑐}⟩


𝑆1 = ⟨ 𝑐 , 𝑎 ⟩ 𝑆1 = ⟨ 𝑎 ⟩
𝑆2 = ⟨ 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

59
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}

PrefixSpan does a database projection for ⟨ 𝑎, 𝑏 , {𝑐}⟩:

Projected database of ⟨ 𝑎, 𝑏 , {𝑐}⟩


𝑆1 = ⟨ 𝑎 ⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

60
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}

Then, PrefixSpan counts the support of each sequential pattern


starting with ⟨ 𝑎, 𝑏 , {𝑐}⟩ that has one more item:
Projected database of ⟨ 𝑎, 𝑏 , {𝑐}⟩
𝑆1 = ⟨ 𝑎 ⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

61
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}

Then, PrefixSpan counts the support of each sequential pattern


starting with ⟨ 𝑎, 𝑏 , {𝑐}⟩ that has one more item:
Projected database of ⟨ 𝑎, 𝑏 , {𝑐}⟩
𝑆1 = ⟨ 𝑎 ⟩

Result:
⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3

62
Step 6 – Find patterns starting with 𝑎, 𝑏 , {𝑐}

Then, PrefixSpan counts the support of each sequential pattern


starting with ⟨ 𝑎, 𝑏 , {𝑐}⟩ that has one more item:
Projected database of ⟨ 𝑎, 𝑏 , {𝑐}⟩
𝑆1 = ⟨ 𝑎 ⟩

Result:
⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3

This pattern is infrequent!

Then, PrefixSpan tries to find


patterns starting with ⟨ 𝑏 ⟩ → 63
Step 7 – Find patterns starting with ⟨ 𝑏 ⟩

Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

64
Step 7 – Find patterns starting with {𝑏}

PrefixSpan does a database projection for ⟨ 𝑏 ⟩:

Sequence database Projected database of ⟨{𝑏}⟩


𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

65
Step 7 – Find patterns starting with {𝑏}

PrefixSpan does a database projection for ⟨ 𝑏 ⟩:

Sequence database Projected database of ⟨{𝑏}⟩


𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ 𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ 𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ 𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

66
Step 7 – Find patterns starting with {𝑏}

PrefixSpan does a database projection for ⟨ 𝑏 ⟩:

Sequence database Projected database of ⟨{𝑏}⟩


𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑆1 = ⟨𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩ 𝑆2 = ⟨𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ 𝑆3 = ⟨ 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ 𝑆4 = ⟨ 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

67
Step 7 – Find patterns starting with {𝑏}

Then, PrefixSpan counts the support of each sequential


pattern starting with ⟨ 𝑏 ⟩ that has one more item:
Projected database of ⟨{𝑏}⟩
𝑆1 = ⟨𝑐 , 𝑎 ⟩
𝑆2 = ⟨𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑎, 𝑏 , {𝑐}⟩

Result:
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑏 , {𝑎}⟩ support : 2
⟨ 𝑏 , {𝑏}⟩ support : 2
⟨ 𝑏 , {𝑐}⟩ support : 3
⟨ 𝑏 , {𝑑}⟩ support : 1
68
Step 7 – Find patterns starting with {𝑏}

Then, PrefixSpan eliminates infrequent patterns:

Projected database of ⟨{𝑏}⟩


𝑆1 = ⟨𝑐 , 𝑎 ⟩
𝑆2 = ⟨𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑎, 𝑏 , {𝑐}⟩

Result:
𝑚𝑖𝑛𝑠𝑢𝑝 = 3
⟨ 𝑏 , {𝑎}⟩ support : 2
⟨ 𝑏 , {𝑏}⟩ support : 2
⟨ 𝑏 , {𝑐}⟩ support : 3
⟨ 𝑏 , {𝑑}⟩ support : 1
Then, PrefixSpan tries to find patterns starting 69
with ⟨ 𝑏 , {𝑐}⟩ →
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩

PrefixSpan does a database projection for⟨ 𝑏}, {𝑐 ⟩:

Projected database of ⟨{𝑏}⟩


𝑆1 = ⟨𝑐 , 𝑎 ⟩
𝑆2 = ⟨𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

70
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩

PrefixSpan does a database projection for⟨ 𝑏}, {𝑐 ⟩:

Projected database of ⟨{𝑏}⟩ Projected database of ⟨ 𝑏 , {𝑐}⟩


𝑆1 = ⟨𝑐 , 𝑎 ⟩ 𝑆1 = ⟨𝑐 , 𝑎 ⟩
𝑆2 = ⟨𝑏 , 𝑐 ⟩ 𝑆2 = ⟨𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑐 , {𝑑}⟩ 𝑆3 = ⟨ 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑎, 𝑏 , {𝑐}⟩ 𝑆4 = ⟨ 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

71
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩

PrefixSpan does a database projection for⟨ 𝑏}, {𝑐 ⟩:

Projected database of ⟨{𝑏}⟩ Projected database of ⟨ 𝑏 , {𝑐}⟩


𝑆1 = ⟨𝑐 , 𝑎 ⟩ 𝑆1 = ⟨𝑎 ⟩
𝑆2 = ⟨𝑏 , 𝑐 ⟩ 𝑆3 = ⟨{𝑑}⟩
𝑆3 = ⟨ 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑎, 𝑏 , {𝑐}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

72
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with⟨ 𝑏}, {𝑐 ⟩ that has one more item:
Projected database of ⟨ 𝑏 , {𝑐}⟩
𝑆1 = ⟨𝑎 ⟩
𝑆3 = ⟨{𝑑}⟩

𝑚𝑖𝑛𝑠𝑢𝑝 = 3

73
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with⟨ 𝑏}, {𝑐 ⟩ that has one more item:
Projected database of ⟨ 𝑏 , {𝑐}⟩
𝑆1 = ⟨𝑎 ⟩
𝑆3 = ⟨{𝑑}⟩

Result:
⟨ 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑏 , 𝑐 , {𝑑} support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3

74
Step 8 – Find patterns starting with ⟨ 𝑏}, {𝑐 ⟩

Then, PrefixSpan counts the support of each sequential pattern


starting with⟨ 𝑏}, {𝑐 ⟩ that has one more item:
Projected database of ⟨ 𝑏 , {𝑐}⟩
𝑆1 = ⟨𝑎 ⟩
𝑆3 = ⟨{𝑑}⟩

Result:
⟨ 𝑏 , 𝑐 , {𝑎}⟩ support : 1
𝑏 , 𝑐 , {𝑑} support : 1
𝑚𝑖𝑛𝑠𝑢𝑝 = 3

All these patterns are infrequent!


PrefixSpan has finished its work.
75
Final result:
Those are the frequent sequential patterns:
• ⟨ 𝑎 ⟩ support : 3
• ⟨ 𝑏 ⟩ support : 4
• ⟨ 𝑐 ⟩ support : 4
• ⟨ 𝑎 , {𝑐}⟩ support: 3
• 𝑎, 𝑏 support : 3
• ⟨ 𝑎, 𝑏 , {𝑐}⟩ support : 3
• ⟨ 𝑏 , {𝑐}⟩ support : 3
76
Observation
PrefixSpan performs a depth-first search:
⟨⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 77
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 78
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 79
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

⟨ 𝑎 , {𝑎}⟩ ⟨ 𝑎 , {𝑏}⟩ ⟨ 𝑎 , {𝑐}⟩ 𝑎, 𝑏

Notation:
Frequent sequential pattern
Infrequent sequential pattern 80
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

⟨ 𝑎 , {𝑎}⟩ ⟨ 𝑎 , {𝑏}⟩ ⟨ 𝑎 , {𝑐}⟩ 𝑎, 𝑏

⟨ 𝑎 , 𝑐 , {𝑎}⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 81
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

⟨ 𝑎 , {𝑎}⟩ ⟨ 𝑎 , {𝑏}⟩ ⟨ 𝑎 , {𝑐}⟩ 𝑎, 𝑏

⟨ 𝑎 , 𝑐 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑎, 𝑏 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑏}⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 82
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

⟨ 𝑎 , {𝑎}⟩ ⟨ 𝑎 , {𝑏}⟩ ⟨ 𝑎 , {𝑐}⟩ 𝑎, 𝑏

⟨ 𝑎 , 𝑐 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑎, 𝑏 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑏}⟩

⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 83
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

⟨ 𝑎 , {𝑎}⟩ ⟨ 𝑎 , {𝑏}⟩ ⟨ 𝑎 , {𝑐}⟩ 𝑎, 𝑏 ⟨ 𝑏 , {𝑎}⟩ ⟨ 𝑏 , {𝑏}⟩ ⟨ 𝑏 , {𝑐}⟩ ⟨ 𝑏 , {𝑑}⟩

⟨ 𝑎 , 𝑐 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑎, 𝑏 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑏}⟩

⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 84
Observation
PrefixSpan performs a depth-first search:
⟨⟩

⟨𝑎⟩ ⟨𝑏⟩ ⟨𝑐⟩ ⟨𝑑⟩

⟨ 𝑎 , {𝑎}⟩ ⟨ 𝑎 , {𝑏}⟩ ⟨ 𝑎 , {𝑐}⟩ 𝑎, 𝑏 ⟨ 𝑏 , {𝑎}⟩ ⟨ 𝑏 , {𝑏}⟩ ⟨ 𝑏 , {𝑐}⟩ ⟨ 𝑏 , {𝑑}⟩

⟨ 𝑎 , 𝑐 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑐}⟩ ⟨ 𝑎, 𝑏 , {𝑎}⟩ ⟨ 𝑎, 𝑏 , {𝑏}⟩ ⟨ 𝑏 , 𝑐 , {𝑎}⟩ 𝑏 , 𝑐 , {𝑑}

⟨ 𝑎, 𝑏 , 𝑐 , {𝑎}⟩

Notation:
Frequent sequential pattern
Infrequent sequential pattern 85
Pseudocode of PrefixSpan (simple version)

PrefixSpan(a database 𝐷, a sequence 𝑆 (initially empty ⟨⟩), 𝑚𝑖𝑛𝑠𝑢𝑝)


1. Scan D to find the support of each sequence starting with S that has one more
item.
2. For each sequence 𝑅 such that sup 𝑅 ≥ 𝑚𝑖𝑛𝑠𝑢𝑝
3. Output 𝑅
4. Create the projected database 𝐷𝑅 of 𝑅 by doing a projection with 𝐷
5. Call PrefixSpan(𝐷𝑅 , 𝑅, 𝑚𝑖𝑛𝑠𝑢𝑝)

86
Optimization 1
• Observation:
– Making a copy of the database for each projection can spend a lot of time!
– A projected database can also take a lot of memory.
• Solution:
– do pseudo-projections
– This means that we don’t make a real copy. We use pointers on the original
database instead.

Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

87
Optimization 1
• Observation:
– Making a copy of the database for each projection can spend a lot of time!
– A projected database can also take a lot of memory.
• Solution:
– do pseudo-projections
– This means that we don’t make a real copy. We use pointers on the original
database instead. Projected database of ⟨ 𝑎 ⟩
𝑆1 = ⟨ _𝑏 , 𝑐 , 𝑎 ⟩
Sequence database 𝑆2 = ⟨ _𝑏 , 𝑏 , 𝑐 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩ 𝑆4 = ⟨ _𝑏 , {𝑐}⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩ Pseudo-projected database of ⟨ 𝑎 ⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩ 𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
88
Optimization 2
• Observation:
– After reading the database to count the support of each item,
PrefixSpan can remove all infrequent items from the database.
– This will reduce the database size…
– This could be done also when creating projected databases.

Sequence database
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝒅}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

89
Optimization 2
• Observation:
– After reading the database to count the support of each item,
PrefixSpan can remove all infrequent items from the database.
– This will reduce the database size…
– This could be done also when creating projected databases.

Sequence database
Sequence database
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆1 = ⟨ 𝑎, 𝑏 , 𝑐 , 𝑎 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆2 = ⟨ 𝑎, 𝑏 , 𝑏 , 𝑐 ⟩
𝑆3 = ⟨ 𝑏 , 𝑐 , {𝑑}⟩
𝑆3 = ⟨𝑏 , 𝑐 ⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩
𝑆4 = ⟨ 𝑏 , 𝑎, 𝑏 , {𝑐}⟩

90
PrefixSpan is a good algorithm?

• Generally, very fast.


• For each frequent pattern, PrefixSpan scans the
database once to count the support of patterns. This
takes linear time w.r.t the database size.
• Creating a projected database is done in linear time
– This can still consume a lot of time and memory.
– But projected databases are always smaller than the original
database.
• Unlike some other algorithms (e.g. GSP), PrefixSpan only
considers patterns that exist in the database.
• PrefixSpan can be easily extended to add constraints
(e.g. maximum length, maximum gap)

91
What influence the performance of
PrefixSpan?

• The minsup threshold


• The database:
– The number of sequences
– The length of sequences
– The sequences are similar?
– The number of distinct items

92
Code, datasets and more…
• A fast Java implementation of PrefixSpan is available in the
SPMF data mining software
(http://www.philippe-fournier-viger.com/spmf/ )
– It can be used as a stand alone software, or as a library.
– Several other sequential pattern mining algorithms are
also provided.
– Datasets are given
• A survey of sequential pattern mining:
– Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R.
(2017). A Survey of Sequential Pattern Mining. Data Science and
Pattern Recognition (DSPR), vol. 1(1), pp. 54-77.

93

You might also like