Apriori
Apriori
Databases
■ Association rule mining
■ Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
■ Mining various kinds of association/correlation rules
■ Constraint-based association mining
■ Sequential pattern mining
■ Applications/extensions of frequent pattern mining
■ Summary
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E}
{B, C, E} 2
October 14, 2023 Data Mining: Concepts and Techniques 13
The Apriori Algorithm
■ Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
■ Challenges
■ Multiple scans of transaction database
■ Huge number of candidates
■ Tedious workload of support counting for
candidates
■ Improving Apriori: general ideas
■ Reduce passes of transaction database scans
■ Shrink number of candidates
■ Facilitate support counting of candidates
October 14, 2023 Data Mining: Concepts and Techniques 27
DIC: Reduce Number of Scans
ABCD
■ Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD ■ Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules
for market basket data. In
SIGMOD’97
October 14, 2023 Data Mining: Concepts and Techniques 28
Partition: Scan Database Only Twice
frequent patterns
■ Scan 2: consolidate global frequent patterns
■ Frequent 1-itemset: a, b, d, e
FP-tree
■ Until the resulting FP-tree is empty, or it contains only one
■ Completeness
■ Preserve complete information for frequent pattern
mining
■ Never break a long pattern of any transaction
■ Compactness
■ Reduce irrelevant info—infrequent items are gone
■ Patterns containing p
■ …
■ Pattern f
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 itemcond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
October 14, 2023 Data Mining: Concepts and Techniques 48
From Conditional Pattern-bases to Conditional FP-trees
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
a1:n1
a2:n2
a3:n3
{} r1
Tran. DB
■ Parallel projection needs a lot fcamp
of disk space fcabm
fb
■ Partition projection saves it
cbp
fcamp
am-proj DB cm-proj DB
fc f …
fc f
fc f
October 14, 2023 Data Mining: Concepts and Techniques 52
FP-Growth vs. Apriori: Scalability With the Support
Threshold
■ Divide-and-conquer:
■ decompose both the mining task and DB according to
the frequent patterns obtained so far
■ leads to focused search of smaller databases
■ Other factors
■ no candidate generation, no candidate test
■ compressed database: FP-tree structure
■ no repeated scan of entire database
■ basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
56
ECLAT: Mining by Exploring Vertical Data Format
■ Vertical format: t(AB) = {T11, T25, …}
■ tid-list: list of trans.-ids containing an itemset
■ Deriving frequent patterns based on vertical intersections
■ t(X) = t(Y): X and Y always happen together
■ t(X) ⊂ t(Y): transaction having X always has Y
■ Using diffset to accelerate mining
■ Only keep track of differences of tids
■ t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
■ Diffset (XY, X) = {T2}
■ Eclat (Zaki et al. @KDD’97)
■ Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
57