0% found this document useful (0 votes)
58 views31 pages

CSE 385 - Data Mining and Business Intelligence - Lecture 03 - Part 01

The document discusses different methods for mining frequent itemsets from transactional databases, including the Apriori algorithm which uses candidate generation and support counting, and the FP-Growth approach which avoids candidate generation by building an FP-tree structure and mining patterns by pattern fragment growth. It also presents methods for improving the efficiency of frequent itemset mining, such as sampling databases to reduce scans, partitioning patterns and databases to mine subsets in parallel, and building conditional FP-trees to recursively mine conditional patterns.

Uploaded by

Islam Ashraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views31 pages

CSE 385 - Data Mining and Business Intelligence - Lecture 03 - Part 01

The document discusses different methods for mining frequent itemsets from transactional databases, including the Apriori algorithm which uses candidate generation and support counting, and the FP-Growth approach which avoids candidate generation by building an FP-tree structure and mining patterns by pattern fragment growth. It also presents methods for improving the efficiency of frequent itemset mining, such as sampling databases to reduce scans, partitioning patterns and databases to mine subsets in parallel, and building conditional FP-trees to recursively mine conditional patterns.

Uploaded by

Islam Ashraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DATA MINING AND

BUSINESS INTELLIGENCE -
LECTURE 03
Dr. Mahmoud Mounir
[email protected]
Scalable Frequent Itemset Mining Methods

◼ Apriori: A Candidate Generation-and-Test Approach

◼ Improving the Efficiency of Apriori

◼ FPGrowth: A Frequent Pattern-Growth Approach

2
Further Improvement of the Apriori Method

◼ Major computational challenges


◼ Multiple scans of transaction database
◼ Huge number of candidates
◼ Tedious workload of support counting for candidates
◼ Improving Apriori: general ideas
◼ Reduce passes of transaction database scans
◼ Shrink number of candidates
◼ Facilitate support counting of candidates

3
Sampling for Frequent Patterns

◼ Select a sample of original database, mine frequent patterns within sample


using Apriori
◼ Scan database once to verify frequent itemsets found in sample, only borders
of closure of frequent patterns are checked
◼ Example: check abcd instead of ab, ac, …, etc.
◼ Scan database again to find missed frequent patterns
◼ H. Toivonen. Sampling large databases for association rules. In VLDB’96

4
Scalable Frequent Itemset Mining Methods

◼ Apriori: A Candidate Generation-and-Test Approach

◼ Improving the Efficiency of Apriori

◼ FPGrowth: A Frequent Pattern-Growth Approach

5
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
◼ Bottlenecks of the Apriori approach
◼ Breadth-first (i.e., level-wise) search
◼ Candidate generation and test
◼ Often generates a huge number of candidates
◼ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
◼ Depth-first search
◼ Avoid explicit candidate generation
◼ Major philosophy: Grow long patterns from short ones using local
frequent items only
◼ “abc” is a frequent pattern
◼ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
◼ “d” is a local frequent item in DB|abc → abcd is a frequent pattern
6
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list = f-c-a-b-m-p p:2 m:1
7
Partition Patterns and Databases

◼ Frequent patterns can be partitioned into subsets


according to f-list
◼ F-list = f-c-a-b-m-p

◼ Patterns containing p

◼ Patterns having m but no p

◼ …

◼ Patterns having c but no a nor b, m, p

◼ Pattern f

◼ Completeness and non-redundency

8
Find Patterns Having P From P-conditional Database

◼ Starting at the frequent item header table in the FP-tree


◼ Traverse the FP-tree by following the link of each frequent item p
◼ Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


9
From Conditional Pattern-bases to Conditional FP-trees

◼ For each pattern-base


◼ Accumulate the count for each item in the base

◼ Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
10
Recursion: Mining Each Conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree

11
(1) Fast scan the transaction database
Items Support Count
Tid Itemset
M 3
T100 {M, O, N, K, E, Y} Find support count of O 3
T200 {D, O, N, K, E, Y} each item. set of (L)
N 2
frequent item patterns
T300 {M, A, K, E}
that contains only items K 5
T400 {M, U, C, K, Y} that achieve minimum E 4
T500 {C, O, O, K, I, E} support
Y 3
Min_Support = 3 D 1
Min_Confidence = 80%
A 1
U 1
C 2
I 1

Items Support Count Items Support Count


Find a set (L) of M 3 K 5
Sort the list in a L
frequent item
O 3 descending or E 4
patterns that
decreasing order
contains only items K 5 M 3
based on each item
that achieve E 4 O 3
support count
minimum support
Y 3 Y 3
Using Apriori Algorithm
Items Support Count Items Support Count Items Support Count
M 3 M, O 1 M, K 3
O 3 M, K 3 O, K 3
K 5 M, E 2 O, E 3
E 4 M, Y 2 K, E 4
Y 3 O, K 3 K, Y 3
O, E 3
O, Y 2
Items Support Count
K, E 4 M, K, O
K, Y 3 M, K, E, O
E, Y 2 M, K, E 2
M, K, Y 2
O, K, E 3
O, K, E, Y 2
K, E, Y 2
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Items Support Count Node Link

K 5

E 4

M 3

O 3

Y 3
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 1
K 5

E: 1
E 4

M: 1
M 3

O 3 O: 1

Y 3 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 2
K 5

E: 2
E 4

M: 1
M 3

O 3 O: 1 O: 1

Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 3
K 5

E: 3
E 4

M: 2
M 3

O 3 O: 1 O: 1

Y 3 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 4
K 5

E: 3
E 4

M: 2 M: 1
M 3

O 3 O: 1 O: 1

Y 3 Y: 1 Y: 1 Y: 1
(2) Construct the FP tree
Tid Itemset Ordered itemset
T100 {M, O, N, K, E, Y} {K, E, M, O, Y}
Order itemset in
each transaction T200 {D, O, N, K, E, Y} {K, E, O, Y}
Build the FP tree
based on their T300 {M, A, K, E} {K, E, M}
priority in list L T400 {M, U, C, K, Y} {K, M, Y}
T500 {C, O, O, K, I, E} {K, E, O}

Null
Items Support Count Node Link

K: 5
K 5

E: 4
E 4

M: 2 M: 1
M 3

O 3 O: 1 O: 2

Y 3 Y: 1 Y: 1 Y: 1
Items Conditional Pattern Base Items Conditional Pattern Base

Y {{KEMO : 1}, {KEO : 1}, {KM : 1}} Y {K :3}

O {{{KEM : 1}, {KE : 2}} O {KE : 3}


Build the Build the
conditional M {{KE : 2, }, {K : 1}} conditional M {K : 3}
pattern base FP tree
E { K : 4} E { K : 4}

K - K -

Items Generated Frequent Patterns

Y < K, Y : 3>

O <K, O : 3> < E, O : 3> <K, E, O : 3>


Generate
frequent M < K, M : 3>
patterns
E < K, E : 4>

K -
Example

Min Support = 2

Confidence = 70%

21
Example
Example
(1) Fast scan the transaction database
Items Support Count
Tid Itemset
I1 6
T100 I1, I2, I5 Find support count of I2 7
T200 I2, I4 each item. set of (L)
I3 6
frequent item patterns
T300 I2, I3
that contains only items I4 2
T400 I1, I2, I4 that achieve minimum I5 2
T500 I1, I3 support
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
Min_Support = 2
Min_Confidence = 70%
Items Support Count
Sort the list in a L I2 7
descending or I1 6
decreasing order
I3 6
based on each item
support count I4 2
I5 2
(2) Construct the FP tree

Tid Itemset Ordered Itemset


T100 I1, I2, I5 I2, I1, I5
T200 I2, I4 I2, I4
Order itemset in
each transaction T300 I2, I3 I2, I3
Build the FP tree
based on their T400 I1, I2, I4 I2, I1, I4
priority in list L T500 I1, I3 I1, I3
T600 I2, I3 I2, I3
T700 I1, I3 I1, I3
T800 I1, I2, I3, I5 I2, I1, I3 , I5
T900 I1, I2, I3 I2, I1, I3
Tid Itemset Ordered Itemset
(2) Construct the FP tree T100 I1, I2, I5 I2, I1, I5
T200 I2, I4 I2, I4
T300 I2, I3 I2, I3
Order itemset in T400 I1, I2, I4 I2, I1, I4
each transaction T500 I1, I3 I1, I3 Build the FP tree
based on their T600 I2, I3 I2, I3
priority in list L T700 I1, I3 I1, I3
T800 I1, I2, I3, I5 I2, I1, I3 , I5
T900 I1, I2, I3 I2, I1, I3
Null
Items Support Count Node Link

I2 7 I2: 7

I1 6 I1: 4 I1: 2

I3 6 I3: 2 I3: 2 I3: 2

I4 2
I4: 1 I4: 1

I5 2
I5: 1 I5: 1
Items Conditional Pattern Base Items Conditional Pattern Base

I5 {{I2 I1 : 1}, {I2 I1 I3 : 1} I5 {I2 I1 : 2}

I4 {{I2 : 1}, {I2 I1 : 1}} I4 {I2 : 2}


Build the Build the
conditional I3 {{I2 I1 : 2 }, {I2 : 2}, {I1 : 2}} conditional I3 {I2 : 4}, {I1 : 4} {I2 I1 : 2}
pattern base FP tree
I1 {{I2 : 4}} I1 {I2 : 4}

I2 - I2 -
Benefits of the FP-tree Structure

◼ Completeness
◼ Preserve complete information for frequent pattern
mining
◼ Never break a long pattern of any transaction
◼ Compactness
◼ Reduce irrelevant info—infrequent items are gone
◼ Items in frequency descending order: the more
frequently occurring, the more likely to be shared
◼ Never be larger than the original database (not count
node-links and the count field)

28
The Frequent Pattern Growth Mining Method

◼ Idea: Frequent pattern growth


◼ Recursively grow frequent patterns by pattern and

database partition
◼ Method
◼ For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree


◼ Repeat the process on each newly created conditional

FP-tree
◼ Until the resulting FP-tree is empty, or it contains only

one path—single path will generate all the


combinations of its sub-paths, each of which is a
frequent pattern

29
Performance of FPGrowth in Large Datasets

100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100

Runtime (sec.)
Run time(sec.)

60
80
50 Data set T25I20D10K Data set T25I20D100K
40 60

30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)

FP-Growth vs. Apriori FP-Growth vs. Tree-Projection

30
Advantages of the Pattern Growth Approach

◼ Divide-and-conquer:
◼ Decompose both the mining task and DB according to the
frequent patterns obtained so far
◼ Lead to focused search of smaller databases
◼ Other factors
◼ No candidate generation, no candidate test
◼ Compressed database: FP-tree structure
◼ No repeated scan of entire database
◼ Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
◼ A good open-source implementation and refinement of FPGrowth
◼ FPGrowth+ (Grahne and J. Zhu, FIMI'03)

31

You might also like