Sequential Pattern Mining
Sequential Pattern Mining
Philippe Fournier-Viger
http://www.philippe-Fournier-viger.com
Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017).
A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition
(DSPR), vol. 1(1), pp. 54-77.
2
What is a discrete sequence?
A sequence is an ordered list of symbols.
3
What is a discrete sequence?
A sequence is an ordered list of symbols.
I go back home
4
What is a discrete sequence?
A sequence is an ordered list of symbols.
a b f g
a b c d
e f g h
5
Sequential Pattern Mining
• It is a popular data mining task, introduced in 1994
by Agrawal & Srikant.
• The goal is to find all subsequences that appear
frequently in a set of discrete sequences.
• For example:
– find sequences of items purchased by many customers
over time,
– find sequences of locations frequently visited by
tourists in a city,
– Find sequences of words that appear frequently in a
text.
6
Definition: Items
Let there be a set of items (symbols) called .
Example:
= apple = dattes
= bread = eggs
= cake
7
Definition: Itemset
An itemset is a set of items that is a subset of .
Example 2:
9
Definition: Subsequence ()
Let there be two sequences:
and .
The sequence is a subsequence of if and only if
there exists integers such that , … .
This is denoted as
Examples:
10
Definition: Sequence database
A sequence database is a set of discrete
sequences where each sequence has a unique
identifier .
11
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:
Example 1:
Sequence database
=3
12
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:
Example 2:
Sequence database
=4
13
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:
Example 3:
Sequence database
𝑠𝑢𝑝 ¿
14
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:
Example 4:
Sequence database
=2
15
Definition: Sequential pattern mining
• Input: A sequence database and a minimum
support threshold .
• Output: All sequential patterns.
A sequential pattern is a sequence where
16
Example 1
INPUT: OUTPUT:
Sequence database
𝑚𝑖𝑛𝑠𝑢𝑝=3
17
Example 1
INPUT: OUTPUT:
18
Example 2
INPUT: OUTPUT:
Sequence database
𝑚𝑖𝑛𝑠𝑢𝑝=4
𝑚𝑖𝑛𝑠𝑢𝑝=4
…
…. ….
….
• An efficient algorithm must find the frequent sequential
patterns, without checking all possibilities.
21
Some popular algorithms
• GSP: R. Agrawal, and R. Srikant, Mining sequential patterns, ICDE 1995, pp. 3–14,
1995.
• SPAM: Ayres, J. Flannick, J. Gehrke, and T. Yiu, Sequential pattern mining using a
bitmap representation, KDD 2002, pp. 429–435, 2002.
• SPADE: M. J. Zaki, SPADE: An efficient algorithm for mining frequent sequences,
Machine learning, vol. 42(1-2), pp. 31–60, 2001.
• PrefixSpan: J. Pei, et al. Mining sequential patterns by pattern-growth: The
prefixspan approach, IEEE Transactions on knowledge and data engineering, vol.
16(11), pp. 1424–1440, 2004.
• CM-SPAM and CM-SPADE: P. Fournier-Viger, A. Gomariz, M. Campos, and R.
Thomas, Fast Vertical Mining of Sequential Patterns Using Co-occurrence
Information, PAKDD 2014, pp. 40–52, 2014.
Kosarak BMS
Leviathan Snake
23
The “Apriori” property
Property (anti-monotonicity).
Let be two subsequences X and Y. If X , then the
support of Y is less than or equal to the support of X.
Example
Sequence database
The support of is 4
The support of is 4
The support of is 1
24