Association
Association
Summary
1
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?
Web pages of interest to groups of users
What are the subsequent purchases after buying a PC?
Finding structural patterns from chemical compounds or social media
Applications
Basket data analysis, cross-marketing, sale campaign analysis, Web
log (click stream) analysis, and DNA sequence analysis & motif
identification 2
Frequent Patterns: Frequent Itemsets
A Frequent pattern in general captures an intrinsic and important
property of a dataset.
Frequent patterns of a transaction database are the set of itemsets
frequently purchased together, which are called as Frequent Itemsets.
itemset: A set of one or more items
k-itemset X = {x1, …, xk}
support count of X: Frequency or occurrence of an itemset X
(relative) support, s, is the fraction of transactions that contains X (i.e.,
the probability that a transaction contains X)
An itemset X is frequent if X’s support is no less than a minsup threshold
Transactions
Tid Items bought containing Transactions
10 Bread, Nuts, Jam both containing
20 Bread, Coffee, Jam Nuts
30 Bread, Jam, Eggs
40 Nuts, Eggs, Milk Transactions
50 Nuts, Coffee, Jam, Eggs, Milk containing Bread
3
Basic Concepts: Association Rules
Tid Items bought
support, s, probability that a
10 Bread, Nuts, Jam transaction contains X Y
20 Bread, Coffee, Jam confidence, c, conditional
30 Bread, Jam, Eggs probability(Y|X) that a
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Jam, Eggs, Milk
transaction having X also
contains Y
Let minsup = 50%, minconf = 50%
Min threshold on sup count=50% of 5≈3
Freq. Pat. : Bread:3, Nuts:3, Jam:4, Eggs:3, {Bread, Jam}:3
Find all the rules X Y with minimum support and
confidence
Association rules formed from the 2-itemset:
Bread Jam (60%, 100%)
Jam Bread (60%, 75%) 4
Computational Complexity of Frequent Itemset
Mining
How many itemsets are possibly generated in the worst case?
Worst case: MN where M: # distinct items, and N: max length of
transactions (all combinations of items of the longest transaction can be
frequent enough)
The number of frequent itemsets to be generated is senstive to the
minsup threshold
When minsup is low, there exist exponential number of frequent
itemsets
The worst case complexity vs. the expected probability
Ex. Suppose Walmart has 104 distinct items
The probability to pick up a specific item is 10-4
The probability to pick up a particular set of 10 items: ~10-40
What is the chance this particular set of 10 products to be frequent
with occurrence of 103 times in 109 transactions?
5
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
Basic Concepts
Summary
6
The Downward Closure Property and Scalable
Approaches to Frequent Pattern Mining
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
First two approaches are there in the syllabus
7
Frequent Itemset Mining Methods
Approach
8
Apriori: A Candidate Generation & Test Approach
Apriori pruning based on anti-monotone property:
If an itemset is found to be infrequent, its supersets are not
candidates to be generated/tested!
Method:
1. Initially, scan DB once to get frequent 1-itemset
2. Generate length (k+1) candidate itemsets from length
k frequent itemsets
3. Test the candidates against DB and identify frequent
candidates
4. Repeat step 2& 3 for next k
5. Terminate when no frequent or candidate set can be
generated 9
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk ;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk ;
11
Implementation of Apriori
13
Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
What is the set of closed itemset?
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern?
<a1, …, a100>: 1
What is the set of all patterns?
!!
14
Representing frequent patterns: Example
15
Association Rule Formation from a
Frequent Pattern
Approach
17
Further Improvement of the Apriori Method
Approach
23
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
24
Construct FP-tree from a Transaction Database
Patterns containing p
…
Pattern f
26
Find Patterns Having P From P-conditional Database
the pattern
Accumulate the count for each item in the base to identify
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
29
A Special Case: Single Prefix Path in FP-tree
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
31
The Frequent Pattern Growth Mining Method
database partition
Method
For each frequent item, construct its conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
32
33
Performance of FPGrowth in Large Datasets
100
90 D1 FP-…
D1 Apriori…
80
70
Run time(sec.)
60
50 Data set T25I20D10K
40
30
20
10
0
0 1Support threshold(%)
2 3
34
Advantages of the Pattern Growth Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the frequent
patterns obtained so far
For huge TDBs the main memory may not be enough to accommodate the FP-tree in
full. The TDB can be partitioned into a set of projected databases along specific
frequent items and then apply FP-growth alg on each projection and the patterns
extracted can be extended by the suffix representing the frequent item.
Basic Concepts
Evaluation Methods
36
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift
37
Are lift and 2 Good Measures of Correlation?
39
Which Null-Invariant Measure Is Better?
IR (Imbalance Ratio): measures the imbalance of two itemsets A and B
in rule implications
Datasets D4 through D6 are all neural (kulc =0.5) even with a lot of
variation in the individual frequencies of ‘m’ and ‘c’. Since their
Kulczynski value is unaffected, it is recommended to use Imbalance
Ratio (IR) together with Kulczynski for extracting interesting patterns.
D4 is balanced & neutral