CSE 385 - Data Mining and Business Intelligence - Lecture 03 - Part 01
CSE 385 - Data Mining and Business Intelligence - Lecture 03 - Part 01
BUSINESS INTELLIGENCE -
LECTURE 03
Dr. Mahmoud Mounir
[email protected]
Scalable Frequent Itemset Mining Methods
2
Further Improvement of the Apriori Method
3
Sampling for Frequent Patterns
4
Scalable Frequent Itemset Mining Methods
5
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
◼ Bottlenecks of the Apriori approach
◼ Breadth-first (i.e., level-wise) search
◼ Candidate generation and test
◼ Often generates a huge number of candidates
◼ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
◼ Depth-first search
◼ Avoid explicit candidate generation
◼ Major philosophy: Grow long patterns from short ones using local
frequent items only
◼ “abc” is a frequent pattern
◼ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
◼ “d” is a local frequent item in DB|abc → abcd is a frequent pattern
6
Construct FP-tree from a Transaction Database
◼ Patterns containing p
◼ …
◼ Pattern f
8
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
11
(1) Fast scan the transaction database
Items Support Count
Tid Itemset
M 3
T100 {M, O, N, K, E, Y} Find support count of O 3
T200 {D, O, N, K, E, Y} each item. set of (L)
N 2
frequent item patterns
T300 {M, A, K, E}
that contains only items K 5
T400 {M, U, C, K, Y} that achieve minimum E 4
T500 {C, O, O, K, I, E} support
Y 3
Min_Support = 3 D 1
Min_Confidence = 80%
A 1
U 1
C 2
I 1
K 5
E 4
M 3
O 3
Y 3
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 1
K 5
E: 1
E 4
M: 1
M 3
O 3 O: 1
Y 3 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 2
K 5
E: 2
E 4
M: 1
M 3
O 3 O: 1 O: 1
Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 3
K 5
E: 3
E 4
M: 2
M 3
O 3 O: 1 O: 1
Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 4
K 5
E: 3
E 4
M: 2 M: 1
M 3
O 3 O: 1 O: 1
Y 3 Y: 1 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}
Null
Items Support Count Node Link
K: 5
K 5
E: 4
E 4
M: 2 M: 1
M 3
O 3 O: 1 O: 2
Y 3 Y: 1 Y: 1 Y: 1
Items Conditional Pattern Base Items Conditional Pattern Base
K - K -
Y < K, Y : 3>
K -
Example
Min Support = 2
Confidence = 70%
21
Example
Example
(1) Fast scan the transaction database
Items Support Count
Tid Itemset
I1 6
T100 I1, I2, I5 Find support count of I2 7
T200 I2, I4 each item. set of (L)
I3 6
frequent item patterns
T300 I2, I3
that contains only items I4 2
T400 I1, I2, I4 that achieve minimum I5 2
T500 I1, I3 support
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
Min_Support = 2
Min_Confidence = 70%
Items Support Count
Sort the list in a L I2 7
descending or I1 6
decreasing order
I3 6
based on each item
support count I4 2
I5 2
(2) Construct the FP tree
I2 7 I2: 7
I1 6 I1: 4 I1: 2
I4 2
I4: 1 I4: 1
I5 2
I5: 1 I5: 1
Items Conditional Pattern Base Items Conditional Pattern Base
I2 - I2 -
Benefits of the FP-tree Structure
◼ Completeness
◼ Preserve complete information for frequent pattern
mining
◼ Never break a long pattern of any transaction
◼ Compactness
◼ Reduce irrelevant info—infrequent items are gone
◼ Items in frequency descending order: the more
frequently occurring, the more likely to be shared
◼ Never be larger than the original database (not count
node-links and the count field)
28
The Frequent Pattern Growth Mining Method
database partition
◼ Method
◼ For each frequent item, construct its conditional
FP-tree
◼ Until the resulting FP-tree is empty, or it contains only
29
Performance of FPGrowth in Large Datasets
100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100
Runtime (sec.)
Run time(sec.)
60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
30
Advantages of the Pattern Growth Approach
◼ Divide-and-conquer:
◼ Decompose both the mining task and DB according to the
frequent patterns obtained so far
◼ Lead to focused search of smaller databases
◼ Other factors
◼ No candidate generation, no candidate test
◼ Compressed database: FP-tree structure
◼ No repeated scan of entire database
◼ Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
◼ A good open-source implementation and refinement of FPGrowth
◼ FPGrowth+ (Grahne and J. Zhu, FIMI'03)
31