Mining Frequent Pattern
Mining Frequent Pattern
Asma Kanwal
Lecturer
What Is Frequent Pattern Analysis?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Transaction data can be broadly interpreted I:
A set of documents…
• A text document data set. Each document is treated as a “bag” of
keywords. Note, text is ordered, but bags of word are not ordered
1 A
B
2
A
C
Example of Association Rules
3 D
C {A} {B}
4
A
A
0 120 0 180
Use of Association Rules
Association rules do not represent any sort of causality or
correlation between the two itemsets.
X Y does not mean X causes Y, so no Causality
X Y can be different from Y X, unlike correlation
• Important Note
– Association rules do not consider order. So…
TID Items
– {Milk, Diaper} {Beer} 1 Bread, Milk
and 2 Bread, Diaper, Beer, Eggs
– {Diaper, Milk} {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
..are the same rule 5 Bread, Milk, Diaper, Coke
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
1 Bread, Milk
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Beer} {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Diaper} {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we can decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
How to Generate Candidates?
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for
candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
Partition: Scan Database Only
Twice
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent
patterns
Scan 2: consolidate global frequent patterns
Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
Candidates: a, b, c, d, e
Hash entries: {ab, ad, ae} {bd, be, de} …
Frequent 1-itemset: a, b, d, e
ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold
Sampling for Frequent Patterns