0% found this document useful (0 votes)
56 views21 pages

Sequential Pattern Mining

Sequential pattern mining is a type of data mining concerned with finding statistically relevant patterns between data examples delivered in a sequence. It involves mining frequently occurring subsequences as patterns from transaction databases where items are ordered. The main algorithms for sequential pattern mining include GSP, which makes multiple database passes to mine frequent sequences of increasing length, SPADE, FreeSpan, PrefixSpan, and others. Sequential pattern mining has applications in market basket analysis, DNA sequence analysis, and other domains involving sequential data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views21 pages

Sequential Pattern Mining

Sequential pattern mining is a type of data mining concerned with finding statistically relevant patterns between data examples delivered in a sequence. It involves mining frequently occurring subsequences as patterns from transaction databases where items are ordered. The main algorithms for sequential pattern mining include GSP, which makes multiple database passes to mine frequent sequences of increasing length, SPADE, FreeSpan, PrefixSpan, and others. Sequential pattern mining has applications in market basket analysis, DNA sequence analysis, and other domains involving sequential data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Sequential Pattern Mining

What is sequential pattern mining?


Sequential pattern mining is a topic of data mining
concerned with finding statistically relevant patterns
between data examples where the values are delivered in a
sequence

It is the mining of frequently appearing series events or


subsequences as patterns.
Sequential Pattern Mining
There are several key traditional computational problems
addressed within this field.

These include building efficient databases and indexes for


sequence information, extracting the frequently occurring
patterns, comparing sequences for similarity, and recovering
missing sequence members.
Sequential Pattern Mining
In general, sequence mining problems can be classified as
string mining which is typically based on string processing
algorithms and itemset mining which is typically based on
association rule learning.
String Mining
String mining typically deals with a limited alphabet for
items that appear in a sequence, but the sequence itself may
be typically very long.

Examples of an alphabet can be those in the ASCII character


set used in natural language text.
Algorithms used for String Mining
Repeat-related problems: that deal with operations on single
sequences and can be based on exact string matching or
approximate string matching methods for finding dispersed
fixed length and maximal length repeats, finding tandem
repeats, and finding unique subsequences and missing
(un-spelled) subsequences.
Algorithms used for String Mining
Alignment problems: that deal with comparison between
strings by first aligning one or more sequences; examples of
popular methods include BLAST for comparing a single
sequence with multiple sequences in a database, and ClustalW
for multiple alignments.

Alignment algorithms can be based on either exact or


approximate methods, and can also be classified as global
alignments, semi-global alignments and local alignment.
Itemset mining
Itemset mining is used in marketing applications for
discovering regularities between frequently co-occurring
items in large transactions.

For example, by analysing transactions of customer shopping


baskets in a supermarket, one can produce a rule which reads
"if a customer buys onions and potatoes together, he or she
is likely to also buy hamburger meat in the same
transaction".
Applications of Sequential Pattern Mining
With a great variation of products and user buying
behaviors, shelf on which products are being displayed is
one of the most important resources in retail environment.
Retailers can not only increase their profit but, also
decrease cost by proper management of shelf space allocation
and products display. To solve this problem, George and Binu
(2013) have proposed an approach to mine user buying
patterns using PrefixSpan algorithm and place the products
on shelves based on the order of mined purchasing patterns.
Concept behind Sequential Mining Pattern
Given a set of sequences, where each sequence includes a
file of events (or elements) and each event includes a group
of items, and given a user-specified minimum provide
threshold of min sup, sequential pattern mining discover all
frequent subsequences, i.e., the subsequences whose
occurrence frequency in the group of sequences is no less
than min_sup.
Concept behind Sequential Mining Pattern
Let I = {I1, I2,..., Ip} be the set of all items. An itemset
is a nonempty set of items. A sequence is an ordered series
of events. A sequence s is indicated {e1, e2, e3 … el} where
event e1 appears before e2, which appears before e3, etc.
Event ej is also known as element of s.
Concept behind Sequential Mining Pattern
In the case of user purchase information, an event defines a
shopping trip in which a customer purchase items at a
specific store. The event is an itemset, i.e., an unordered
list of items that the customer purchased during the trip.
The itemset (or event) is indicated (x1x2···xq), where xk is
an item.
Concept behind Sequential Mining Pattern
An item can appear just once in an event of a sequence, but
can appear several times in different events of a sequence.
The multiple instances of items in a sequence is known as
the length of the sequence. A sequence with length l is
known as l-sequence.
Concept behind Sequential Mining Pattern
A sequence database, S, is a group of tuples, (SID, s),
where SID is a sequence_ID and s is a sequence. For
instance, S includes sequences for all users of the store. A
tuple (SID, s) is include a sequence α, if α is a
subsequence of s.
Concept behind Sequential Mining Pattern
This phase of sequential pattern mining is an abstraction of
user-shopping sequence analysis. Scalable techniques for
sequential pattern mining on such records are as follows −

In DNA sequence analysis, approximate patterns become


helpful because DNA sequences can include (symbol)
insertions, deletions, and mutations. Such diverse
requirements can be considered as constraint relaxation or
application.
Algorithms for Sequential Pattern Mining
● GSP algorithm
● Sequential Pattern Discovery using Equivalence classes
(SPADE)
● FreeSpan
● PrefixSpan
● MAPres
● Seq2Pat (for constraint-based sequential pattern mining)
Generalized Sequential Pattern algorithm (GSP)
The algorithms for solving sequence mining problems are
mostly based on the apriori (level-wise) algorithm. One way
to use the level-wise paradigm is to first discover all the
frequent items in a level-wise fashion.

It simply means counting the occurrences of all singleton


elements in the database. Then, the transactions are
filtered by removing the non-frequent items. At the end of
this step, each transaction consists of only the frequent
elements it originally contained.
Generalized Sequential Pattern algorithm (GSP)
GSP algorithm makes multiple database passes. In the first
pass, all single items (1-sequences) are counted. From the
frequent items, a set of candidate 2-sequences are formed,
and another pass is made to identify their frequency.

The frequent 2-sequences are used to generate the candidate


3-sequences, and this process is repeated until no more
frequent sequences are found.
Two Main Steps of GSP
Candidate Generation. Given the set of frequent
(k-1)-frequent sequences F(k-1), the candidates for the next
pass are generated by joining F(k-1) with itself. A pruning
phase eliminates any sequence, at least one of whose
subsequences is not frequent.

Support Counting. Normally, a hash tree–based search is


employed for efficient support counting. Finally non-maximal
frequent sequences are removed.
Pseudocode for GSP
The End

You might also like