0% found this document useful (0 votes)
9 views32 pages

DM Lect 5 - Sequence & Stream Mining

The document discusses various data mining techniques, focusing on frequent pattern mining, sequence mining, and stream data processing. It outlines methods such as Apriori, GSP, and PrefixSpan for mining sequential patterns and addresses challenges in processing data streams, including real-time processing and memory limits. Additionally, it introduces strategies like sampling, window processing, and the Lossy Counting Algorithm for managing frequent itemsets in dynamic data environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views32 pages

DM Lect 5 - Sequence & Stream Mining

The document discusses various data mining techniques, focusing on frequent pattern mining, sequence mining, and stream data processing. It outlines methods such as Apriori, GSP, and PrefixSpan for mining sequential patterns and addresses challenges in processing data streams, including real-time processing and memory limits. Additionally, it introduces strategies like sampling, window processing, and the Lossy Counting Algorithm for managing frequent itemsets in dynamic data environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Frequent Patterns Data Mining

Mining
Sequence & Stream
Mining

Dr. Wedad Hussein


[email protected]

Dr. Mahmoud Mounir


[email protected]
Data Mining Techniques
Data Mining

Frequent
Patterns Clustering Classification
Mining

Association Sequence
Rules Mining

Apriori GSP

FP-Growth SPADE

ECLAT PrefixSpan
Element vs Event

Sequence Sequence Element Event


Database (Item)
Web Data Browsing activity of a A collection of files viewed Home page, index
particular Web visitor by a Web visitor after a page, contact info, etc
single mouse click
E-Commerce Whole purchase All products bought in a Individual products
Data history of a customer single transaction

Element Event

<{i1, i2, i4}, {i3}, {i2, i4}>


Subsequence Examples

Sequence A Sequence B Is B subsequence


of A?

<{2,4} {3,5,6} <{2} {3,6} {8}> Yes


{8}>
<{2,4} {3,5,6} <{2} {8}> Yes
{8}>
No
<{1,2} {3,4}> <{1} {2}>
Yes
<{2,4} {2,4} <{2} {4}>
{2,5}>
Sequence Mining Techniques
• Apriori-based method: GSP (Generalized
Sequential Patterns: Srikant & Agrawal)
• Pattern-growth methods: FreeSpan &
PrefixSpan (Han et al; Pei et al.)
• Vertical format-based mining: SPADE (M. J.
Zaki)
PrefixSpan
Prefix and Suffix (Projection)
• <{a}>, <{a}{a}>, <{a}{ab}> and <{a}{abc}>
are prefixes of sequence <{a}{abc}{ac}{d}{cf}>
• Given sequence <{a}{abc}{ac}{d}{cf}>

Prefix Suffix (Prefix-Based


Projection)
<{a}> <{abc}{ac}{d}{cf}>
<{a}{a}> <{_bc}{ac}{d}{cf}>
<{a}{ab}> <{_c}{ac}{d}{cf}>

7
Prefix Projection

• Step 1: find length-1 sequential patterns


• <a>, <b>, <c>, <d>, <e>, <f>
• Step 2: divide search space. The complete
set of seq. pat. can be partitioned into 6
subsets:
• The ones having prefix <a>; SID sequence
• The ones having prefix <b>; 10 <{a}{abc}{ac}{d}{cf}>
• … 20 <{ad}{c}{bc}{ae}>
• 30 <{ef}{ab}{df}{c}{b}>
The ones having prefix <f>
40 <{e}{g}{af}{c}{b}{c}>

8
PrefixSpan - Example
• 1. Find length 1 sequential
patterns:
id Sequence
10 <{a}{abc}{ac}{d}{cf}>
20 <{ad}{c}{bc}{ae}>
30 <{ef}{ab}{df}{c}{b}>
40 <{e}{g}{af}{c}{b}{c}>

>a< >b< >c< >d< >e< >f< >g<

4 4 4 3 3 3 1

Frequent Events:
<a>,<b>,<c>,<d>,<e>,<f>
PrefixSpan - Example
id Sequence
10 <{a}{abc}{ac}{d}{cf}>
20 <{ad}{c}{bc}{ae}>
• 2. Divide search space 30 <{ef}{ab}{df}{c}{b}>
40 <{e}{g}{af}{c}{b}{c}>

Prefix

>a< >c<
<{abc}{ac}{d}{cf}> <{ac}{d}{cf}> >e<
<{_d}{c}{bc}{ae}> <{bc}{ae}>
<{_f}{ab}{df}{c}{b}>
<{_b}{df}{c}{b}> <{b}>
<{af}{c}{b}{c}>
<{_f}{c}{b}{c}> <{b}{c}>

>b< >d<
<{_c}{ac}{d}{cf}> <{cf}> >f<
<{_c}{ae}> <{c}{bc}{ae}> <{ab}{df}{c}{b}>
<{df}{c}{b}> <{_f}{c}{b}> <{c}{b}{c}>
<{c}>
PrefixSpan – Example
>d<
<{cf}>
<{c}{bc}{ae}>
<{_f}{c}{b}>

>a< >b< >c< >d< >e< >f< }f_{<


>
1 2 3 0 1 1 1

<{d}{b}>
<{d}{c}>

Frequent Sequences
PrefixSpan – Example
• Continue with frequent sequences:
>d<
<{cf}>
<{c}{bc}{ae}>
<{_f}{c}{b}>

Use b as a prefix Use c as a prefix

<{d}{b}> <{d}{c}>
<{bc}{ae}> >b< >a< >e< >c<
<{_c}{ae}>
<{b}> 2 1 1 1

Frequent: <{d}{c}{b}>
<{d}{c}{b}>
>}ae{<
PrefixSpan – Example
• Find combinations with c:
>c<
<{ac}{d}{cf}> >a< >b < >c< >d< > e< >f<
<{bc}{ae}> 2 3 3 1 1 1
<{b}>
<{b}{c}>
<{c}{a}> <{c}{b}> <{c}{c}>

<{c}{b}> >c< >c_< >a< > e<


<{c}{a}> <{c}{c}>
<{_c}{ae}> 1 1 1 1 <{_f}>
<{_c}{d}{cf}> <{c}>
<{_e}
No frequent eventsNo frequent events No frequent events
Data Streams
Stream Data

• Data arrives fast.


• If not processed immediately it will
be lost.
• Examples:
• Sensor Data
• Image Data
• Web Traffic
• Social Media
Issues with Stream Data

• Processing is done in real-time.


• Multiple passes are not possible.
• Memory limits.
• Accuracy vs. Storage.
Processing Data
Streams
A. Sampling

• Keep an unbiased sample of the


data.
• Use reservoir sampling.
• Keeps a sample of size s.
• Every new element in the stream
has a probability to replace any old
element.
B. Window Processing

• A window is a subset of the arrived


transactions.
• Types of windows:
• Landmark Window
• Sliding Window
• Damped Window
Landmark Window

• From a fixed starting point I to current


time t.
• If i=1, we are processing the whole
stream.
• All time points are equally important.
Sliding Window

• Focus on recent data only.


• Given a window size w and current
time t we process W[t-w+1,t].
• Window moves with data.
Damped Window

• Associates weights with the data in


the stream, and gives higher
weights to recent data than those
in the past.
C. Histograms
• Approximate the distribution of values.
• Divide range into buckets of equal width or
equal depth.
• Can be used later to approximate query
answers.
D. Stream Queries

• One-time Query: evaluated once at


a certain point in time.
• Continuous Query: evaluated
continuously as data streams
continue to arrive.
Frequent Pattern
Mining in Data
Streams
Challenges

• Multiple passes are not achievable.


• The frequency of items change
(frequent items can become
infrequent and so on)
• The number of infrequent itemsets
is exponential.
Approaches

• Keep track of limited set of


itemsets.
• Disadvantage: limited usage and
expressiveness.
• Derive and approximate set of
answers.
• Approximate the frequency within a
certain error limit.
Lossy Counting Algoritm

• Inputs: min_support threshold (σ)


and error bound (ε).
• Divide into buckets of width:
ceiling (1/ ε)
Finding Frequent Items
New bucket
arrives

Add item to list No


Item
with f=1 and ∆=b-
exists
1

Ye
s
Increase
frequency

Ye Delete items
No End of
s with
bucket
f+∆≤b

∆, the maximum possible error on the frequency


count
Support Estimation

• The support can be underestimated


but still it can be proven that no
frequent items are lost.
• Properties of the algorithm:
• There are no false negatives.
• False positives are quite “positive”.
• The error in estimation is never very
high.
Finding Frequent Itemsets
Read β buckets
into memory

Generate
Subsets

Count frequency

f≥β No Itemse
t
Yes exists
Insert with into Yes
list
No f
+∆≤b

Yes
All
No subsets
Delete itemset
processe
d
Yes

∆=b-β
Thank You

You might also like