Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
LECTURE 3
Dr. Mahmoud Mounir
[email protected]
Scalable Frequent Itemset Mining Methods
2
Further Improvement of the Apriori Method
3
Sampling for Frequent Patterns
4
Scalable Frequent Itemset Mining Methods
5
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
6
Construct FP-tree from a Transaction Database
Patterns containing p
…
Pattern f
8
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
11
(1) Fast scan the transaction database
Items Support Count
Tid Itemset
M 3
T100 {M, O, N, K, E, Y} Find support count of O 3
T200 {D, O, N, K, E, Y} each item. set of (L)
N 2
T300 {M, A, K, E}
frequent item patterns
that contains only items K 5
T400 {M, U, C, K, Y} that achieve minimum E 4
T500 {C, O, O, K, I, E} support
Y 3
Min_Support = 3 D 1
Min_Confidence = 80%
A 1
U 1
C 2
I 1
F-List
K 5
E 4
M 3
O 3
Y 3
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 1
K 5
E: 1
E 4
M 3
M: 1
O 3 O: 1
Y 3 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 2
K 5
E: 2
E 4
M 3
M: 1
O 3 O: 1 O: 1
Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 3
K 5
E: 3
E 4
M 3
M: 2
O 3 O: 1 O: 1
Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 4
K 5
E: 3
E 4
M 3
M: 2 M: 1
O 3 O: 1 O: 1
Y 3 Y: 1 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 5
K 5
E: 4
E 4
M 3
M: 2 M: 1
O 3 O: 1 O: 2
Y 3 Y: 1 Y: 1 Y: 1
Items Conditional Pattern Base Items Conditional Pattern Base
K - K -
K: 5
Items Generated Frequent Patterns
Y < K, Y : 3>
E: 4
O <K, O : 3> < E, O : 3> <K, E, O : 3>
Generate
frequent M < K, M : 3> M: 2 M: 1
patterns
E < K, E : 4>
m-conditional FP-tree
K -
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
20
The Frequent Pattern Growth Mining Method
database partition
Method
For each frequent item, construct its conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
21
Performance of FPGrowth in Large Datasets
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Runtime (sec.)
Run time(sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
22
Advantages of the Pattern Growth Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the
frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
A good open-source implementation and refinement of FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
23