0% found this document useful (0 votes)
49 views

Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg

The document discusses methods for mining frequent itemsets from transactional databases. It covers the Apriori algorithm and improvements to its efficiency including reducing database scans. It also describes the FP-Growth approach which avoids candidate generation by building an FP-tree structure and mining patterns by recursively projecting conditional databases. The document provides details on how FP-trees are constructed and how patterns are mined from the structure through partitioning and recursively mining conditional databases.

Uploaded by

Samy Mahmoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg

The document discusses methods for mining frequent itemsets from transactional databases. It covers the Apriori algorithm and improvements to its efficiency including reducing database scans. It also describes the FP-Growth approach which avoids candidate generation by building an FP-tree structure and mining patterns by recursively projecting conditional databases. The document provides details on how FP-trees are constructed and how patterns are mined from the structure through partitioning and recursively mining conditional databases.

Uploaded by

Samy Mahmoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

DATA MINING -

LECTURE 3
Dr. Mahmoud Mounir
[email protected]
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

2
Further Improvement of the Apriori Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

3
Sampling for Frequent Patterns

 Select a sample of original database, mine frequent patterns within sample


using Apriori
 Scan database once to verify frequent itemsets found in sample, only borders
of closure of frequent patterns are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent patterns
 H. Toivonen. Sampling large databases for association rules. In VLDB’96

4
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

5
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent pattern
6
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list = f-c-a-b-m-p p:2 m:1
7
Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets


according to f-list
 F-list = f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f

 Completeness and non-redundency

8
Find Patterns Having P From P-conditional Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


9
From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
10
Recursion: Mining Each Conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree

11
(1) Fast scan the transaction database
Items Support Count
Tid Itemset
M 3
T100 {M, O, N, K, E, Y} Find support count of O 3
T200 {D, O, N, K, E, Y} each item. set of (L)
N 2
T300 {M, A, K, E}
frequent item patterns
that contains only items K 5
T400 {M, U, C, K, Y} that achieve minimum E 4
T500 {C, O, O, K, I, E} support
Y 3
Min_Support = 3 D 1
Min_Confidence = 80%
A 1
U 1
C 2
I 1
F-List

Items Support Count Items Support Count


Find a set (L) of M 3 K 5
Sort the list in a L
frequent item
O 3 descending or E 4
patterns that
decreasing order
contains only items K 5 M 3
based on each item
that achieve E 4 O 3
support count
minimum support
Y 3 Y 3
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Items Support Count Node Link

K 5

E 4

M 3

O 3

Y 3
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 1
K 5

E: 1
E 4

M 3
M: 1

O 3 O: 1

Y 3 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 2
K 5

E: 2
E 4

M 3
M: 1

O 3 O: 1 O: 1

Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 3
K 5

E: 3
E 4

M 3
M: 2

O 3 O: 1 O: 1

Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 4
K 5

E: 3
E 4

M 3
M: 2 M: 1

O 3 O: 1 O: 1

Y 3 Y: 1 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
T200 {D, O, N, K, E, Y} {K, E, O, Y}
each transaction Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 5
K 5

E: 4
E 4

M 3
M: 2 M: 1

O 3 O: 1 O: 2

Y 3 Y: 1 Y: 1 Y: 1
Items Conditional Pattern Base Items Conditional Pattern Base

Y {{KEMO : 1}, {KEO : 1}, {KM : 1}} Y {K :3}

O {{{KEM : 1}, {KE : 2}} O {KE : 3}


Build the Build the
conditional M {{KE : 2, }, {K : 1}} conditional M {K : 3}
pattern base FP tree
E { K : 4} E { K : 4}

K - K -

K: 5
Items Generated Frequent Patterns

Y < K, Y : 3>
E: 4
O <K, O : 3> < E, O : 3> <K, E, O : 3>
Generate
frequent M < K, M : 3> M: 2 M: 1
patterns
E < K, E : 4>
m-conditional FP-tree
K -
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not count
node-links and the count field)

20
The Frequent Pattern Growth Mining Method

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by pattern and

database partition
 Method
 For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree


 Repeat the process on each newly created conditional

FP-tree
 Until the resulting FP-tree is empty, or it contains only

one path—single path will generate all the


combinations of its sub-paths, each of which is a
frequent pattern

21
Performance of FPGrowth in Large Datasets

100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100

Runtime (sec.)
Run time(sec.)

60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60

30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)

FP-Growth vs. Apriori FP-Growth vs. Tree-Projection

22
Advantages of the Pattern Growth Approach

 Divide-and-conquer:
 Decompose both the mining task and DB according to the
frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
 A good open-source implementation and refinement of FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03)

23

You might also like