0% found this document useful (0 votes)
3 views

Sequential Pattern Mining

This document provides an introduction to Sequential Pattern Mining, focusing on the analysis of discrete sequences to identify frequent patterns. It defines key concepts such as discrete sequences, itemsets, sequence databases, and the support of sequences, and discusses the challenges of efficiently mining these patterns. Additionally, it highlights popular algorithms used for sequential pattern mining and their performance considerations.

Uploaded by

vineetsuradkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Sequential Pattern Mining

This document provides an introduction to Sequential Pattern Mining, focusing on the analysis of discrete sequences to identify frequent patterns. It defines key concepts such as discrete sequences, itemsets, sequence databases, and the support of sequences, and discusses the challenges of efficiently mining these patterns. Additionally, it highlights popular algorithms used for sequential pattern mining and their performance considerations.

Uploaded by

vineetsuradkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

An Introduction to

Sequential Pattern Mining

Philippe Fournier-Viger
http://www.philippe-Fournier-viger.com

Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017).
A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition
(DSPR), vol. 1(1), pp. 54-77.

Source code and datasets available in the SPMF library 1


Introduction
• Data Mining: the goal is to discover or extract
useful knowledge from data.
• Many types of data can be analyzed: graphs,
relational databases, time series, sequences,
etc.
• In this presentation, we focus on analyzing a
common type of data called discrete
sequences to find interesting patterns in it.

2
What is a discrete sequence?
A sequence is an ordered list of symbols.

Example 1: a sequence can be the items that are


purchased by a customer over time:

Computer Monitor Router

3
What is a discrete sequence?
A sequence is an ordered list of symbols.

Example 2: a sequence can be the list of words in a


sentence:

I go back home

4
What is a discrete sequence?
A sequence is an ordered list of symbols.

Example 3: a sequence can be the list of locations


visited by a car in a city

a b f g

a b c d

e f g h

5
Sequential Pattern Mining
• It is a popular data mining task, introduced in 1994
by Agrawal & Srikant.
• The goal is to find all subsequences that appear
frequently in a set of discrete sequences.
• For example:
– find sequences of items purchased by many customers
over time,
– find sequences of locations frequently visited by
tourists in a city,
– Find sequences of words that appear frequently in a
text.
6
Definition: Items
Let there be a set of items (symbols) called .

Example:

= apple = dattes

= bread = eggs

= cake

7
Definition: Itemset
An itemset is a set of items that is a subset of .

Example: is an itemset containing 3 items

is an itemset containing 2 items

• An itemset having items is called a k-itemset.


• Note: an itemset cannot contain a same item twice.
8
Definition: Sequence
A discrete sequence is a an ordered list of itemsets
where for any

Example 1: is a sequence containing two itemsets.

It means that a customer purchased at the same time


and then purchased .

Example 2:

9
Definition: Subsequence ()
Let there be two sequences:
and .
The sequence is a subsequence of if and only if
there exists integers such that , … .

This is denoted as

Examples:

10
Definition: Sequence database
A sequence database is a set of discrete
sequences where each sequence has a unique
identifier .

Example 1: This is a sequence database with


four sequences :
Sequence database

11
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:

Example 1:
Sequence database

=3

12
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:

Example 2:
Sequence database

=4

13
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:

Example 3:
Sequence database

𝑠𝑢𝑝 ⁡¿

14
Definition: Support of a sequence
The number of sequences in a sequence
database that contain a sequence is called the
support of It is defined as:

Example 4:
Sequence database

=2

15
Definition: Sequential pattern mining
• Input: A sequence database and a minimum
support threshold .
• Output: All sequential patterns.
A sequential pattern is a sequence where

16
Example 1
INPUT: OUTPUT:

Sequence database

𝑚𝑖𝑛𝑠𝑢𝑝=3

17
Example 1
INPUT: OUTPUT:

Sequence database all sequential patterns:


support = 3
support = 4
support = 4
support = 3
support = 2
support = 4
𝑚𝑖𝑛𝑠𝑢𝑝=3 support = 3

What will happen if we change the threshold? 

18
Example 2
INPUT: OUTPUT:

Sequence database

𝑚𝑖𝑛𝑠𝑢𝑝=4

Observation: If we increase the minsup


threshold, less patterns may be found
19
Example 2
INPUT: OUTPUT:

Sequence database all sequential patterns:


support = 4
support = 4
support = 4

𝑚𝑖𝑛𝑠𝑢𝑝=4

Observation: If we increase the minsup


threshold, less patterns may be found
20
It is a difficult problem!
• A naïve algorithm would read the database and count the
support (frequency) of all possible patterns.
• Inefficient because there can be a very large number of
sequential patterns.
• For example:
, , ….
….


…. ….
….
• An efficient algorithm must find the frequent sequential
patterns, without checking all possibilities.
21
Some popular algorithms
• GSP: R. Agrawal, and R. Srikant, Mining sequential patterns, ICDE 1995, pp. 3–14,
1995.
• SPAM: Ayres, J. Flannick, J. Gehrke, and T. Yiu, Sequential pattern mining using a
bitmap representation, KDD 2002, pp. 429–435, 2002.
• SPADE: M. J. Zaki, SPADE: An efficient algorithm for mining frequent sequences,
Machine learning, vol. 42(1-2), pp. 31–60, 2001.
• PrefixSpan: J. Pei, et al. Mining sequential patterns by pattern-growth: The
prefixspan approach, IEEE Transactions on knowledge and data engineering, vol.
16(11), pp. 1424–1440, 2004.
• CM-SPAM and CM-SPADE: P. Fournier-Viger, A. Gomariz, M. Campos, and R.
Thomas, Fast Vertical Mining of Sequential Patterns Using Co-occurrence
Information, PAKDD 2014, pp. 40–52, 2014.

They all have the same input and output.


The difference is performance due to optimizations, search strategies and data structures!

Fast implementations available in the SPMF library


22
A performance comparison
Four benchmark datasets are used

Kosarak BMS

Leviathan Snake

23
The “Apriori” property
Property (anti-monotonicity).
Let be two subsequences X and Y. If X , then the
support of Y is less than or equal to the support of X.

Example
Sequence database

The support of is 4
The support of is 4

The support of is 1

24

You might also like