Unit4 2 Association Rules FP Growth
Unit4 2 Association Rules FP Growth
3 FP-Growth, FP-Tree
1
Scalable Frequent Itemset Mining Methods
2
Apriori vs FPGrowth
• Bottlenecks of the Apriori approach
• Breadth-first (i.e., level-wise) search
• Candidate generation and test
• Often generates a huge number of candidates
• The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
• Depth-first search
• Avoid explicit candidate generation
• Major philosophy: Grow long patterns from short ones using local frequent items only
• “abc” is a frequent pattern
• Get all transactions having “abc”, i.e., project DB on abc: DB|abc
• “d” is a local frequent item in DB|abc à abcd is a frequent pattern
3
Frequent Pattern (FP) Growth Method
•- Mining frequent itemsets without candidate generation.
•- It is a divide and conquers strategy.
•- It compress the database representing frequent items into a frequent –pattern
tree (FP- Tree), which retains the itemsets association information.
•- Divides the compressed database into a set of conditional databases, each
associated with one frequent item or pattern fragment and then mines each such
database separately.
•- FP-Growth method transforms the problem of finding long frequent patterns to
searching for shorter ones recursively and then concatenating the suffix.
•- It uses least frequent items as suffix .
Adv: Reduce search cost, has good selectivity, faster than apriori.
Disadv: When the database is large, it is sometimes unrealistic to construct a
main memory based FP-tree.
Frequent Pattern (FP) Growth Algorithm has 2 steps:
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
Calculate Support Count (Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Tid Items
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
Calculate Support Count
(Descending order):
I2:7
I1:6
I3:6
I4:2
I5:2
Summary of problem solution (FROM BOOK)
Write in this way in exam::
Benefits of the FP-tree Structure
• Completeness
• Preserve complete information for frequent pattern mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more frequently
occurring, the more likely to be shared
• Never be larger than the original database (not count node-
links and the count field)
32
Advantages of the Pattern Growth Approach
• Divide-and-conquer:
• Decompose both the mining task and DB according to the frequent
patterns obtained so far
• Lead to focused search of smaller databases
• Other factors
• No candidate generation, no candidate test
• Compressed database: FP-tree structure
• No repeated scan of entire database
• Basic ops: counting local freq items and building sub FP-tree, no pattern
search and matching
• A good open-source implementation and refinement of FPGrowth
• FPGrowth+ (Grahne and J. Zhu, FIMI'03)
33
Q: What is the most significant advantage of FP-Tree? Why FP-
Tree is complete in relevance to frequent pattern mining?
• Efficiency, the most significant advantage of the FP-tree is that it requires
two scans to the underlying database (and only two scans) to construct
the FP-tree. This efficiency is further apparent in database with prolific
and long patterns or for mining frequent patterns with low support
threshold.
• As each transaction in the database is mapped to one path in the FP-Tree,
therefore, the frequent item-set information in each transaction is
completely stored in the FP-Tree. Besides, one path in the FP-Tree may
represent frequent item-sets in multiple transactions without ambiguity
since the path representing every transaction must start from the root of
each item prefix sub-tree.