Sequential Pattern Mining

Sequential pattern mining is a type of data mining concerned with finding statistically relevant patterns between data examples delivered in a sequence. It involves mining frequently occurring subsequences as patterns from transaction databases where items are ordered. The main algorithms for sequential pattern mining include GSP, which makes multiple database passes to mine frequent sequences of increasing length, SPADE, FreeSpan, PrefixSpan, and others. Sequential pattern mining has applications in market basket analysis, DNA sequence analysis, and other domains involving sequential data.

Uploaded by

TuLbig E. Winnower

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views21 pages

Sequential Pattern Mining

Uploaded by

TuLbig E. Winnower

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Sequential Pattern Mining

What is sequential pattern mining?

Sequential pattern mining is a topic of data mining
concerned with finding statistically relevant patterns
between data examples where the values are delivered in a
sequence

It is the mining of frequently appearing series events or

subsequences as patterns.
Sequential Pattern Mining
There are several key traditional computational problems
addressed within this field.

These include building efficient databases and indexes for

sequence information, extracting the frequently occurring
patterns, comparing sequences for similarity, and recovering
missing sequence members.
Sequential Pattern Mining
In general, sequence mining problems can be classified as
string mining which is typically based on string processing
algorithms and itemset mining which is typically based on
association rule learning.
String Mining
String mining typically deals with a limited alphabet for
items that appear in a sequence, but the sequence itself may
be typically very long.

Examples of an alphabet can be those in the ASCII character

set used in natural language text.
Algorithms used for String Mining
Repeat-related problems: that deal with operations on single
sequences and can be based on exact string matching or
approximate string matching methods for finding dispersed
fixed length and maximal length repeats, finding tandem
repeats, and finding unique subsequences and missing
(un-spelled) subsequences.
Algorithms used for String Mining
Alignment problems: that deal with comparison between
strings by first aligning one or more sequences; examples of
popular methods include BLAST for comparing a single
sequence with multiple sequences in a database, and ClustalW
for multiple alignments.

Alignment algorithms can be based on either exact or

approximate methods, and can also be classified as global
alignments, semi-global alignments and local alignment.
Itemset mining
Itemset mining is used in marketing applications for
discovering regularities between frequently co-occurring
items in large transactions.

For example, by analysing transactions of customer shopping

baskets in a supermarket, one can produce a rule which reads
"if a customer buys onions and potatoes together, he or she
is likely to also buy hamburger meat in the same
transaction".
Applications of Sequential Pattern Mining
With a great variation of products and user buying
behaviors, shelf on which products are being displayed is
one of the most important resources in retail environment.
Retailers can not only increase their profit but, also
decrease cost by proper management of shelf space allocation
and products display. To solve this problem, George and Binu
(2013) have proposed an approach to mine user buying
patterns using PrefixSpan algorithm and place the products
on shelves based on the order of mined purchasing patterns.
Concept behind Sequential Mining Pattern
Given a set of sequences, where each sequence includes a
file of events (or elements) and each event includes a group
of items, and given a user-specified minimum provide
threshold of min sup, sequential pattern mining discover all
frequent subsequences, i.e., the subsequences whose
occurrence frequency in the group of sequences is no less
than min_sup.
Concept behind Sequential Mining Pattern
Let I = {I1, I2,..., Ip} be the set of all items. An itemset
is a nonempty set of items. A sequence is an ordered series
of events. A sequence s is indicated {e1, e2, e3 … el} where
event e1 appears before e2, which appears before e3, etc.
Event ej is also known as element of s.
Concept behind Sequential Mining Pattern
In the case of user purchase information, an event defines a
shopping trip in which a customer purchase items at a
specific store. The event is an itemset, i.e., an unordered
list of items that the customer purchased during the trip.
The itemset (or event) is indicated (x1x2···xq), where xk is
an item.
Concept behind Sequential Mining Pattern
An item can appear just once in an event of a sequence, but
can appear several times in different events of a sequence.
The multiple instances of items in a sequence is known as
the length of the sequence. A sequence with length l is
known as l-sequence.
Concept behind Sequential Mining Pattern
A sequence database, S, is a group of tuples, (SID, s),
where SID is a sequence_ID and s is a sequence. For
instance, S includes sequences for all users of the store. A
tuple (SID, s) is include a sequence α, if α is a
subsequence of s.
Concept behind Sequential Mining Pattern
This phase of sequential pattern mining is an abstraction of
user-shopping sequence analysis. Scalable techniques for
sequential pattern mining on such records are as follows −

In DNA sequence analysis, approximate patterns become

helpful because DNA sequences can include (symbol)
insertions, deletions, and mutations. Such diverse
requirements can be considered as constraint relaxation or
application.
Algorithms for Sequential Pattern Mining
● GSP algorithm
● Sequential Pattern Discovery using Equivalence classes
(SPADE)
● FreeSpan
● PrefixSpan
● MAPres
● Seq2Pat (for constraint-based sequential pattern mining)
Generalized Sequential Pattern algorithm (GSP)
The algorithms for solving sequence mining problems are
mostly based on the apriori (level-wise) algorithm. One way
to use the level-wise paradigm is to first discover all the
frequent items in a level-wise fashion.

It simply means counting the occurrences of all singleton

elements in the database. Then, the transactions are
filtered by removing the non-frequent items. At the end of
this step, each transaction consists of only the frequent
elements it originally contained.
Generalized Sequential Pattern algorithm (GSP)
GSP algorithm makes multiple database passes. In the first
pass, all single items (1-sequences) are counted. From the
frequent items, a set of candidate 2-sequences are formed,
and another pass is made to identify their frequency.

The frequent 2-sequences are used to generate the candidate

3-sequences, and this process is repeated until no more
frequent sequences are found.
Two Main Steps of GSP
Candidate Generation. Given the set of frequent
(k-1)-frequent sequences F(k-1), the candidates for the next
pass are generated by joining F(k-1) with itself. A pruning
phase eliminates any sequence, at least one of whose
subsequences is not frequent.

Support Counting. Normally, a hash tree–based search is

employed for efficient support counting. Finally non-maximal
frequent sequences are removed.
Pseudocode for GSP
The End

Lecture Notes in Computer Science 4726: Editorial Board
No ratings yet
Lecture Notes in Computer Science 4726: Editorial Board
319 pages
Basic Bioinformatics - S. Ignacimuthu
100% (4)
Basic Bioinformatics - S. Ignacimuthu
232 pages
User Manual PDF
No ratings yet
User Manual PDF
1,032 pages
Course Contents-IISERB
No ratings yet
Course Contents-IISERB
228 pages
BLAST (Basic Local Alignment Search Tool)
100% (1)
BLAST (Basic Local Alignment Search Tool)
23 pages
B Tech CSE Sem5 Syllabus
No ratings yet
B Tech CSE Sem5 Syllabus
19 pages
STS - SCIENCE, TECHNOLOGY, A ... Llabus AY2020-2021 Sem 2
No ratings yet
STS - SCIENCE, TECHNOLOGY, A ... Llabus AY2020-2021 Sem 2
7 pages
Ii To Iv Semesters (Full Time) Curriculum and Syllabus
0% (1)
Ii To Iv Semesters (Full Time) Curriculum and Syllabus
22 pages
DNA Fragment Assembly: An Ant Colony System Approach
No ratings yet
DNA Fragment Assembly: An Ant Colony System Approach
12 pages
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
No ratings yet
Emboss (Pairwise Sequence Alignment: Prepared By:-Bansari Patel (19it02) M.Sc. IT (SEM-2
19 pages
Data Integration
No ratings yet
Data Integration
18 pages
DBT BET Question Paper 2012 With Answer Key
67% (3)
DBT BET Question Paper 2012 With Answer Key
24 pages
Bioinformatics Answers
100% (1)
Bioinformatics Answers
13 pages
Jalview 2.8: A Manual and Introductory Tutorial
No ratings yet
Jalview 2.8: A Manual and Introductory Tutorial
89 pages
Homology Modelling and Autodock
No ratings yet
Homology Modelling and Autodock
25 pages
CBR PHD Courses 2022
No ratings yet
CBR PHD Courses 2022
8 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Process Mining and Data Stream Mining
No ratings yet
Process Mining and Data Stream Mining
19 pages
Sequence Alignment
No ratings yet
Sequence Alignment
17 pages
Association Rule Learning
No ratings yet
Association Rule Learning
16 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Blast 2 Sequences: Salman Khan Current Gpa in Bioinf 4 Gpa
No ratings yet
Blast 2 Sequences: Salman Khan Current Gpa in Bioinf 4 Gpa
45 pages
Biotech Report Group 3
No ratings yet
Biotech Report Group 3
12 pages
Goloboff & Catalano 2016 - TNT Version 1.5 Including A Full Implementaion of Phylogenetic Morphometrics PDF
No ratings yet
Goloboff & Catalano 2016 - TNT Version 1.5 Including A Full Implementaion of Phylogenetic Morphometrics PDF
18 pages
2yrs Mca Sem3
No ratings yet
2yrs Mca Sem3
9 pages
MAT500 Paper Phylogenetics
100% (1)
MAT500 Paper Phylogenetics
19 pages
Gap Penalty - Wikipedia
No ratings yet
Gap Penalty - Wikipedia
6 pages
Where Did The BLOSUM62 Alignment Score Matrix Come From?: Primer
No ratings yet
Where Did The BLOSUM62 Alignment Score Matrix Come From?: Primer
2 pages
Genomics and Bioinformatics
No ratings yet
Genomics and Bioinformatics
17 pages
Alignment Correction
No ratings yet
Alignment Correction
3 pages
Bioinformatics Database Worksheet
No ratings yet
Bioinformatics Database Worksheet
10 pages
Role of Bioinformatics in Agriculture
No ratings yet
Role of Bioinformatics in Agriculture
6 pages
Recombinant Dna Technology: Course Code: BTB 601 Credit Units: 03 Course Objective
No ratings yet
Recombinant Dna Technology: Course Code: BTB 601 Credit Units: 03 Course Objective
16 pages
Affiliated Institutions Anna University of Technology Chennai:: Chennai 600 113 Curriculum 2010
No ratings yet
Affiliated Institutions Anna University of Technology Chennai:: Chennai 600 113 Curriculum 2010
19 pages
Multiple Sequence Alignment: Sumbitted To: DR - Navneet Choudhary
No ratings yet
Multiple Sequence Alignment: Sumbitted To: DR - Navneet Choudhary
23 pages

Sequential Pattern Mining

Uploaded by

Sequential Pattern Mining

Uploaded by

Sequential Pattern Mining

What is sequential pattern mining?

It is the mining of frequently appearing series events or

These include building efficient databases and indexes for

Examples of an alphabet can be those in the ASCII character

Alignment algorithms can be based on either exact or

For example, by analysing transactions of customer shopping

In DNA sequence analysis, approximate patterns become

It simply means counting the occurrences of all singleton

The frequent 2-sequences are used to generate the candidate

Support Counting. Normally, a hash tree–based search is

You might also like