FP Growth Datamining Lect 5
FP Growth Datamining Lect 5
Course Teacher
Books
• “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Overview
• The FP-tree contains a compressed
representation of the transaction database.
• A trie (prefix-tree) data structure is used
• Each transaction is a path in the tree – paths can
overlap.
Definition of ‘trie’
9
FP-tree Construction
• The FP-tree is a trie (prefix tree)
TID Items
1 {A,B} • Since transactions are sets of
2 {B,C,D} items, we need to transform them
3 {A,C,D,E} into ordered sequences so that
4 {A,D,E}
we can have prefixes
5 {A,B,C}
• Otherwise, there is no common prefix
6 {A,B,C,D}
7 {B,C} between sets {A,B} and {B,C,A}
8 {A,B,C} • We need to impose an order to
9 {A,B,D} the items
10 {B,C,E} • Initially, assume a lexicographic order.
10
FP-tree Construction
• Initially the tree is empty
TID Items
1 {A,B} null
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
11
FP-tree Construction
• Reading transaction TID = 1
• Each node in the tree has a label consisting of the item and
the support (number of transactions that reach that node, i.e.
follow that path)
12
FP-tree Construction
• Reading transaction TID = 2
TID Items null
1 {A,B}
2 {B,C,D}
A:1 B:1
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} B:1
6 {A,B,C,D}
C:1
7 {B,C}
8 {A,B,C} D:1
9 {A,B,D}
10 {B,C,E} Each transaction is a path in the tree
FP-tree Construction
TID Items
null
1 {A,B} After reading
2 {B,C,D} transactions TID=1, 2:
3 {A,C,D,E} A:1 B:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} B:1 C:1
7 {B,C}
8 {A,B,C}
Header Table D:1
9 {A,B,D}
10 {B,C,E} Item Pointer
A
B
The Header Table and the C
pointers assist in D
computing the itemset E
support
14
FP-tree Construction
null
• Reading transaction TID = 3
TID Items
A:1 B:1
1 {A,B}
2 {B,C,D}
3 {A,C,D,E} B:1 C:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} D:1
7 {B,C} Item Pointer
8 {A,B,C} A
9 {A,B,D} B
10 {B,C,E} C
D
E
15
FP-tree Construction
null
• Reading transaction TID = 3
TID Items
A:2 B:1
1 {A,B}
2 {B,C,D}
3 {A,C,D,E} B:1
C:1 C:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} D:1
7 {B,C} Item Pointer D:1
8 {A,B,C} A
9 {A,B,D} B E:1
10 {B,C,E} C
D
E
16
FP-tree Construction
null
• Reading transaction TID = 3
TID Items
A:2 B:1
1 {A,B}
2 {B,C,D}
3 {A,C,D,E} B:1
C:1 C:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} D:1
7 {B,C} Item Pointer D:1
8 {A,B,C} A
9 {A,B,D} B E:1
10 {B,C,E} C
D
E
FP-Tree Construction
TID Items Each transaction is a path in the tree
1 {A,B}
Transaction
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E} C:1 D:1
E:1
Header table D:1
C:3
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
18
FP-tree size
• Every transaction is a path in the FP-tree
• The size of the tree depends on the
compressibility of the data
• Extreme case: All transactions are the same, the FP-
tree is a single branch
• Extreme case: All transactions are different the size of
the tree is the same as that of the database (bigger
actually since we need additional pointers)
19
Item ordering
• The size of the tree also depends on the ordering of the items.
• Heuristic: order the items in according to their frequency from larger
to smaller.
• We would need to do an extra pass over the dataset to count frequencies
• Example:
Frequent itemsets
All Itemsets
Ε D C B A
DE CE BE AE CD BD AD BC AC AB
CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC
ABCDE
22
Frequent Itemsets
All Itemsets
Ε D C B A
Frequent?;
DE CE BE AE CD BD AD BC AC AB
Frequent?;
CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC
Frequent?
ABCDE
23
Frequent Itemsets
All Itemsets
Frequent?
Ε D C B A
DE CE BE AE CD BD AD BC AC AB
Frequent?
CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC
Frequent? Frequent?
ABCDE
24
Frequent Itemsets
All Itemsets
Ε D C B A
Frequent?
DE CE BE AE CD BD AD BC AC AB
Frequent?
CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC
Frequent?
ABCDE
We can generate all itemsets this way
We expect the FP-tree to contain a lot less
25
Header table
E:1
D:1
C:3
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D
Bottom-up traversal of the tree.
E First, itemsets ending in E, then D,
etc, each time a suffix-based class
26
B:5 C:3
C:1 D:1
We will then see how to compute the support for the possible itemsets
27
B:5 C:3
C:1 D:1
Ending in C
A:7 B:3
B:5 C:3
C:1 D:1
Ending in B
A:7 B:3
B:5 C:3
C:1 D:1
A:7 B:3
B:5 C:3
C:1 D:1
Algorithm
• For each suffix X
• Phase 1
• Construct the prefix tree for X as shown before, and
compute the support using the header table and the
pointers
• Phase 2
• If X is frequent, construct the conditional FP-tree for X in
the following steps
1. Recompute support
2. Prune infrequent items
3. Prune leaves and recurse
32
Example
null
Phase 1 – construct
prefix tree
A:7 B:3
Find all prefix paths that
contain E
B:5 C:3
C:1 D:1
Example
null
Phase 1 – construct
prefix tree
A:7 B:3
Find all prefix paths that
contain E
C:3
C:1 D:1
E:1
Example
null
Compute Support for E
(minsup = 2)
A:7 B:3
How?
Follow pointers while
summing up counts: C:3
1+1+1 = 3 > 2 C:1 D:1
E is frequent
E:1
Example
null
E is frequent so we proceed with Phase 2
E:1
36
Example
null
Recompute Support
A:7 B:3
Example
null
A:7 B:3
C:3
C:1 D:1
E:1
38
Example
null
A:7 B:3
C:1
C:1 D:1
E:1
39
Example
null
A:7 B:1
C:1
C:1 D:1
E:1
40
Example
null
A:7 B:1
C:1
C:1 D:1
E:1
41
Example
null
A:7 B:1
C:1
C:1 D:1
E:1
42
Example
null
A:2 B:1
C:1
C:1 D:1
E:1
43
Example
null
A:2 B:1
C:1
C:1 D:1
E:1
44
Example
null
A:2 B:1
Truncate
Delete the nodes of Ε C:1
C:1 D:1
E:1
45
Example
null
A:2 B:1
Truncate
Delete the nodes of Ε C:1
C:1 D:1
E:1
46
Example
null
A:2 B:1
Truncate
Delete the nodes of Ε C:1
C:1 D:1
D:1
47
Example
null
A:2 B:1
Prune infrequent
In the conditional FP-tree C:1
some nodes may have C:1 D:1
support less than minsup
e.g., B needs to be D:1
pruned
This means that B
appears with E less than
minsup times
48
Example
null
A:2 B:1
C:1
C:1 D:1
D:1
49
Example
null
A:2 C:1
C:1 D:1
D:1
50
Example
null
A:2 C:1
C:1 D:1
D:1
Example
null
A:2 C:1
C:1 D:1
D:1
Phase 1
Find all prefix paths that contain D (DE) in the conditional FP-tree
52
Example
null
A:2
C:1 D:1
D:1
Phase 1
Find all prefix paths that contain D (DE) in the conditional FP-tree
53
Example
null
A:2
C:1 D:1
D:1
{D,E} is frequent
54
Example
null
A:2
C:1 D:1
D:1
Phase 2
Example
null
A:2
D:1
56
Example
null
A:2
D:1
57
Example
null
A:2
Example
null
A:2
Small support
Prune nodes C:1
59
Example
null
A:2
Example
null
A:2 C:1
C:1 D:1
D:1
Example
null
A:2 C:1
C:1 D:1
D:1
Phase 1
Find all prefix paths that contain C (CE) in the conditional FP-tree
62
Example
null
A:2 C:1
C:1
Phase 1
Find all prefix paths that contain C (CE) in the conditional FP-tree
63
Example
null
A:2 C:1
C:1
{C,E} is frequent
64
Example
null
A:2 C:1
C:1
Phase 2
Example
null
A:1 C:1
Example
null
A:1 C:1
Example
null
A:1
Prune nodes
68
Example
null
A:1
Prune nodes
69
Example
null
Prune nodes
Example
null
A:2 C:1
C:1 D:1
D:1
Example
null
A:2 C:1
C:1 D:1
D:1
Phase 1
Find all prefix paths that contain A (AE) in the conditional FP-tree
72
Example
null
A:2
Phase 1
Find all prefix paths that contain A (AE) in the conditional FP-tree
73
Example
null
A:2
{A,E} is frequent
Example
• So for E we have the following frequent itemsets
{E}, {D,E}, {A,D,E}, {C,E}, {A,E}
Example
null
Ending in D
A:7 B:3
B:5 C:3
C:1 D:1
Example
null
Phase 1 – construct
prefix tree B:3
A:7
Find all prefix paths that
contain D
Support 5 > minsup, D is B:5 C:3
C:1 D:1
frequent
Phase 2 D:1
C:3
Convert prefix tree into D:1
conditional FP-tree D:1
D:1
77
Example
null
A:7 B:3
B:5 C:3
C:1 D:1
D:1
C:1
D:1
D:1
D:1
Recompute support
78
Example
null
A:7 B:3
B:2 C:3
C:1 D:1
D:1
C:1
D:1
D:1
D:1
Recompute support
79
Example
null
A:3 B:3
B:2 C:3
C:1 D:1
D:1
C:1
D:1
D:1
D:1
Recompute support
80
Example
null
A:3 B:3
B:2 C:1
C:1 D:1
D:1
C:1
D:1
D:1
D:1
Recompute support
81
Example
null
A:3 B:1
B:2 C:1
C:1 D:1
D:1
C:1
D:1
D:1
D:1
Recompute support
82
Example
null
A:3 B:1
B:2 C:1
C:1 D:1
D:1
C:1
D:1
D:1
D:1
Prune nodes
83
Example
null
A:3 B:1
B:2 C:1
C:1
C:1
Prune nodes
84
Example
null
A:3 B:1
B:2 C:1
C:1
C:1
And so on….
85
Observations
• At each recursive step we solve a subproblem
• Construct the prefix tree
• Compute the new support
• Prune nodes
• Subproblems are disjoint so we never consider
the same itemset twice
Observations
• The efficiency of the algorithm depends on the
compaction factor of the dataset