KDDM-Lecture 3
KDDM-Lecture 3
— Chapter 6 —
3
Scalable Frequent Itemset Mining Methods
Approach
Format
4
Basic Concepts: Frequent Patterns
5
The Downward Closure Property and Scalable
Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
diaper}
i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
6
Apriori: A Candidate Generation & Test Approach
7
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
8
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 9
Implementation of Apriori
11
Basic Concepts: Association Rules
Tid Items bought computerantivirus software [support
10 Beer, Nuts, Diaper
=2%,confidence =60%]
20 Beer, Coffee, Diaper
2% of all the transactions under analysis
30 Beer, Diaper, Eggs
show that computer and antivirus
40 Nuts, Eggs, Milk
software are purchased together
50 Nuts, Coffee, Diaper, Eggs, Milk
Customer
Customer 60% of the customers who purchased a
buys both
buys computer also bought the software
diaper
Association rules:
Beer Diaper (0.6, 1)
Customer
Diaper
Diaper Beer
Beer (0.6, 0.75)
buys beer
12
Scalable Frequent Itemset Mining Methods
13
Further Improvement of the Apriori Method
14
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent
patterns
Scan 2: consolidate global frequent patterns
17
Scalable Frequent Itemset Mining Methods
18
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
19
Construct FP-tree from a Transaction Database
Patterns containing p
…
Pattern f
21