Sequential pattern mining is a type of data mining concerned with finding statistically relevant patterns between data examples delivered in a sequence. It involves mining frequently occurring subsequences as patterns from transaction databases where items are ordered. The main algorithms for sequential pattern mining include GSP, which makes multiple database passes to mine frequent sequences of increasing length, SPADE, FreeSpan, PrefixSpan, and others. Sequential pattern mining has applications in market basket analysis, DNA sequence analysis, and other domains involving sequential data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
56 views21 pages
Sequential Pattern Mining
Sequential pattern mining is a type of data mining concerned with finding statistically relevant patterns between data examples delivered in a sequence. It involves mining frequently occurring subsequences as patterns from transaction databases where items are ordered. The main algorithms for sequential pattern mining include GSP, which makes multiple database passes to mine frequent sequences of increasing length, SPADE, FreeSpan, PrefixSpan, and others. Sequential pattern mining has applications in market basket analysis, DNA sequence analysis, and other domains involving sequential data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21
Sequential Pattern Mining
What is sequential pattern mining?
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence
It is the mining of frequently appearing series events or
subsequences as patterns. Sequential Pattern Mining There are several key traditional computational problems addressed within this field.
These include building efficient databases and indexes for
sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. Sequential Pattern Mining In general, sequence mining problems can be classified as string mining which is typically based on string processing algorithms and itemset mining which is typically based on association rule learning. String Mining String mining typically deals with a limited alphabet for items that appear in a sequence, but the sequence itself may be typically very long.
Examples of an alphabet can be those in the ASCII character
set used in natural language text. Algorithms used for String Mining Repeat-related problems: that deal with operations on single sequences and can be based on exact string matching or approximate string matching methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences. Algorithms used for String Mining Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include BLAST for comparing a single sequence with multiple sequences in a database, and ClustalW for multiple alignments.
Alignment algorithms can be based on either exact or
approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. Itemset mining Itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions.
For example, by analysing transactions of customer shopping
baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction". Applications of Sequential Pattern Mining With a great variation of products and user buying behaviors, shelf on which products are being displayed is one of the most important resources in retail environment. Retailers can not only increase their profit but, also decrease cost by proper management of shelf space allocation and products display. To solve this problem, George and Binu (2013) have proposed an approach to mine user buying patterns using PrefixSpan algorithm and place the products on shelves based on the order of mined purchasing patterns. Concept behind Sequential Mining Pattern Given a set of sequences, where each sequence includes a file of events (or elements) and each event includes a group of items, and given a user-specified minimum provide threshold of min sup, sequential pattern mining discover all frequent subsequences, i.e., the subsequences whose occurrence frequency in the group of sequences is no less than min_sup. Concept behind Sequential Mining Pattern Let I = {I1, I2,..., Ip} be the set of all items. An itemset is a nonempty set of items. A sequence is an ordered series of events. A sequence s is indicated {e1, e2, e3 … el} where event e1 appears before e2, which appears before e3, etc. Event ej is also known as element of s. Concept behind Sequential Mining Pattern In the case of user purchase information, an event defines a shopping trip in which a customer purchase items at a specific store. The event is an itemset, i.e., an unordered list of items that the customer purchased during the trip. The itemset (or event) is indicated (x1x2···xq), where xk is an item. Concept behind Sequential Mining Pattern An item can appear just once in an event of a sequence, but can appear several times in different events of a sequence. The multiple instances of items in a sequence is known as the length of the sequence. A sequence with length l is known as l-sequence. Concept behind Sequential Mining Pattern A sequence database, S, is a group of tuples, (SID, s), where SID is a sequence_ID and s is a sequence. For instance, S includes sequences for all users of the store. A tuple (SID, s) is include a sequence α, if α is a subsequence of s. Concept behind Sequential Mining Pattern This phase of sequential pattern mining is an abstraction of user-shopping sequence analysis. Scalable techniques for sequential pattern mining on such records are as follows −
In DNA sequence analysis, approximate patterns become
helpful because DNA sequences can include (symbol) insertions, deletions, and mutations. Such diverse requirements can be considered as constraint relaxation or application. Algorithms for Sequential Pattern Mining ● GSP algorithm ● Sequential Pattern Discovery using Equivalence classes (SPADE) ● FreeSpan ● PrefixSpan ● MAPres ● Seq2Pat (for constraint-based sequential pattern mining) Generalized Sequential Pattern algorithm (GSP) The algorithms for solving sequence mining problems are mostly based on the apriori (level-wise) algorithm. One way to use the level-wise paradigm is to first discover all the frequent items in a level-wise fashion.
It simply means counting the occurrences of all singleton
elements in the database. Then, the transactions are filtered by removing the non-frequent items. At the end of this step, each transaction consists of only the frequent elements it originally contained. Generalized Sequential Pattern algorithm (GSP) GSP algorithm makes multiple database passes. In the first pass, all single items (1-sequences) are counted. From the frequent items, a set of candidate 2-sequences are formed, and another pass is made to identify their frequency.
The frequent 2-sequences are used to generate the candidate
3-sequences, and this process is repeated until no more frequent sequences are found. Two Main Steps of GSP Candidate Generation. Given the set of frequent (k-1)-frequent sequences F(k-1), the candidates for the next pass are generated by joining F(k-1) with itself. A pruning phase eliminates any sequence, at least one of whose subsequences is not frequent.
Support Counting. Normally, a hash tree–based search is
employed for efficient support counting. Finally non-maximal frequent sequences are removed. Pseudocode for GSP The End