Week 3
Week 3
• Basic Concepts
Methods
• Summary
2
What Is Frequent Pattern
Analysis?
• Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami
[AIS93] in the context of frequent itemsets and
association rule mining
A simple example from market
basket analysis
• Five transactions in a supermarket:
14
Mining Frequent Patterns, Association and Correlations:
Basic Concepts and Methods
• Basic Concepts
Methods
• Summary
15
Scalable Frequent Itemset Mining Methods
Format
16
The Downward Closure Property and Scalable
Mining Methods
• The downward closure property of frequent patterns
• Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also contains
{beer, diaper}
• Scalable mining methods: Three major approaches
• Apriori (Agrawal & Srikant@VLDB’94)
• Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
• Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
17
Apriori: A Candidate Generation & Test Approach
18
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
Itemset Itemset Sup
C3 C3
{A, B, C} {A, B, C} 1
3rd scan {A, C, E} 1
Itemset sup
{A, C, E} L3 {B, C, E} 2
19 {B, C, E} 2
{B, C, E}
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=Æ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return Èk Lk;
20
Implementation of Apriori
• How to generate candidates Ck+1 from Lk?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• abce from abc and ace
• NO abcde !
• Pruning:
• acde is removed because ade is not in L3
• abce is removed because bce is not in L3
• C4 = {abcd}
21
Generating Association Rules
from Frequent Itemsets
• Strong association rules: min_support,
min_confident
• Steps
• For each frequent itemset l, generate all nonempty
subsets of l.
• For every nonempty subset s of l, output the rule
s⇒l–s
If min_confidence
Example
• X={I1,I2,I5}
• What are the association rules
generated from X?
24
Further Improvement of the Apriori Method
25
Partition
• Any itemset that is potentially frequent in DB must be frequent
in at least one of the partitions of DB
• Scan 1: partition database and find local frequent patterns
• Scan 2: consolidate global frequent patterns
• ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below
support threshold
27
Sampling for Frequent Patterns
30
Dynamic item counting (DIC): Reduce Number of Scans
ABCD
Transactions
A B C D
1-itemsets
Apriori 2-itemsets
{}
…
Itemset lattice
1-itemsets
2-items
DIC 3-items
31
Scalable Frequent Itemset Mining Methods
32
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
34
FP Tree construction
• Root is represented as null
• Scan the data set one transaction at a time to
create the FP-tree. For each transaction:
• If it is a unique transaction, form a new path and set the
counter for each node to 1.
• If it shares a common prefix itemset then increment the
common itemset node counters and create new nodes if
needed.
• Continue this until each transaction has been
mapped unto the tree.
FP Tree with Header table and node-links
{}
Header Table
p:2 m:1
Partition Patterns and Databases
• Frequent patterns can be partitioned into subsets
according to f-list
• F-list = f-c-a-b-m-p
• Patterns containing p - p’s conditional patternbase
• Patterns having m but no p - m’s conditional patternbase
• …
• Patterns having c but no a nor b, m, p - c’s conditional
patternbase
• Pattern f
• Completeness and non-redundency
37
Find Patterns Having P From P-conditional Database
{}
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
• Completeness
• Preserve complete information for frequent pattern mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more frequently
occurring, the more likely to be shared
• Never be larger than the original database (not count node-
links and the count field)
43
Scaling FP-growth by Database Projection
44
Partition-Based Projection
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
am-proj DB cm-proj DB
fc f …
fc f
fc f
45
Performance of FPGrowth in Large Datasets
100
90 D1 FP-grow th runtime
D1 Apriori runtime
80
70
Run time(sec.)
60
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
46
Advantages of the Pattern Growth Approach
• Divide-and-conquer:
• Decompose both the mining task and DB according to the frequent
patterns obtained so far
• Lead to focused search of smaller databases
• Other factors
• No candidate generation, no candidate test
• Compressed database: FP-tree structure
• No repeated scan of entire database
• Basic ops: counting local freq items and building sub FP-tree, no pattern
search and matching
47
Further Improvements of Mining Methods
48
Extension of Pattern Growth Mining Methodology
50
ECLAT: Mining by Exploring Vertical Data Format
52
CLOSET+: Mining Closed Itemsets by Pattern-Growth
• Pruning
• Itemset merging: if Y appears in every occurrence of X, then Y is
merged with X
• Sub-itemset pruning: if Y כX, and sup(X) = sup(Y), X and all of X’s
descendants in the set enumeration tree can be pruned
• Checking
• superset checking
• Subset checking
Mining Frequent Patterns, Association and Correlations:
Basic Concepts and Methods
• Basic Concepts
Methods
• Summary
55
Interestingness Measure: Correlations (Lift)
• play basketball Þ not eat cereal [20%, 33.3%] is more accurate, although with
lower support and confidence
56
• “Buy walnuts Þ buy milk
[1%, 80%]” is misleading if
85% of customers buy milk
• Over 20 interestingness
measures have been
proposed (see Tan, Kumar,
Sritastava @KDD’02)
57
Null-Invariant Measures
58
Mining Frequent Patterns, Association and Correlations:
Basic Concepts and Methods
• Basic Concepts
Methods
• Summary
59
Summary
• Basic concepts: association rules, support-confident
framework, closed and max-patterns
• Scalable frequent pattern mining methods
• Apriori (Candidate generation & test)
• Projection-based (FPgrowth, CLOSET+, ...)
• Vertical format approach (ECLAT, ...)
60
Assignment 1 & Midterm
• A1 is posted in the dropbox.
• Due on Feb 2nd at 5pm
• Submit to dropbox
• Only ipynb