Association Rule Mining
Association Rule Mining
Chapter 4
1
Association rule mining
• Basic Concepts
• Frequent Pattern and Association rule Mining
• Association rule Evaluation
• Issues in Association rule mining
• Classification of Frequent Pattern Mining
• Mining Frequent Itemsets
• The Apriori Algorithm
• Multi-level Associations rules
• Multi-Dimensional Association rule mining
2
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a dataset
• First proposed by Agrawal et al. [1] in the context of frequent
itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and
diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between
sets of items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf.
3
Management of Data (SIGMOD’93), pp. 207–216, Washington, DC, May 1993.
What Is Frequent Pattern Analysis?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
4
Basic Concepts: Frequent Patterns
Tid Items bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
• (absolute) support, or, support
30 Beer, Diaper, Eggs
count of X: Frequency or
40 Nuts, Eggs, Milk
occurrence of an itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk
• (relative) support, s, is the fraction
Customer Customer
of transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Customer
buys beer
6
Basic Concepts: Association Rules
Tid Items bought
10 Beer, Nuts, Diaper
• Find all the rules X Y with
20 Beer, Coffee, Diaper
minimum support and confidence
30 Beer, Diaper, Eggs – support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk – confidence, c, conditional
Customer
probability that a transaction
Customer
buys both
buys
having X also contains Y
diaper Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Customer
buys beer
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
7
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
– Computationally prohibitive!
8
Mining Association Rules
Tid Items bought
Example of Rules:
10 Bread, Milk
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
20 Bread, Diaper, Beer, Eggs
30 Milk, Diaper, Beer, Coke
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
40 Bread, Milk, Diaper, Beer
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
50 Bread, Milk, Diaper, Coke {Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
• can have different confidence
• Thus, we may decouple the support and confidence requirements
9
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
• Generate all itemsets whose support ≥ minsup
2. Rule Generation
• Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
• Frequent itemset generation is still
computationally expensive
10
Frequent Itemset Generation
11
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions List of Candidates
Tid Items bought
10 Bread, Milk
20 Bread, Diaper, Beer, Eggs
N m=2d
30 Milk, Diaper, Beer, Coke
40 Bread, Milk, Diaper, Beer
50 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity: O(Nmw): this is costly 12
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates
or transactions
– No need to match every candidate against every
transaction
13
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must
also be frequent
• Apriori principle holds due to the following
property of the support measure:
15
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
16
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 17
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd} 18
Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine
the support of each candidate itemset
– To reduce the number of comparisons, store the
candidates in a hash structure
• Instead of matching each transaction against every
candidate, match it against candidates contained in the
hashed buckets
19
How to Count Supports of Candidates?
Hash on 1,4 or 7
21
Association Rule Discovery: Hash Tree
Hash Function
3,6,9
1,4,7
2,5,8
234
567
Hash on 2,5 or 8
22
Association Rule Discovery: Hash Tree
Hash Function
1,4,7 3,6,9
2,5,8
234
567
Hash on 3,6 or 9
23
Subset operation
Given a transaction T, what are the possible subsets of size 3?
Transaction T:
12356
Level 2
12 356 13 56 15 6 23 56 25 6 35 6
Subsets of 3 items
24
Subset operation
Transaction:
12356
1+ 2356 2+ 3 5 6 3+ 5 6
1 2+ 3 5 6 1 3+ 5 6 1 5+ 6 2 3+ 5 6 25 6 35 6
26
Maximal Frequent Itemset
28
Closed Itemset
• An itemset is closed if none of its immediate
supersets has the same support as the itemset
Items Support
TID Items {A} 4 Items Support
1 {A,B} {B} 5 {A,B,C} 2
2 {B,C,D} {C} 3 {A,B,D} 3
{D} 4
3 {A,B,C,D} {A, C,D} 2
{A,B} 4
4 {A,B,D} {A,C} 2 {B,C,D} 3
5 {A,B,C,D {A,D} 3 {A,B,C,D} 2
{B,C} 3
{B, D} 4
{C,D} 3
29
Maximal vs Closed Itemsets
30
Maximal vs Closed Itemsets
31
The Frequent Pattern Growth Mining Method
32
FP-growth Algorithm
• Use a compressed representation of the
database using an FP-tree
• Once an FP-tree has been constructed, it uses
a recursive divide-and-conquer approach to
mine the frequent itemsets
33
FP-tree construction
TID Items
null
1 {A,B}
2 {B,C,D} A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
B:5 C:1 D:1
5 {A,B,C} C:3
6 {A,B,C,D}
C:3 D:1 D:1 E:1 D:1 E:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} D:1 E:1
Items s
10 {B,C,E} B 8
A 7
C 7
D 5
E 3 34
FP-tree construction
Items
null
E
D A:7 B:3
C
A
B:5 C:1 D:1
B C:3
D:1 E:1
Items s
B 8
A 7
C 7
D 5
E 3 35
Benefits of the FP-tree Structure
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more likely
to be shared
– never be larger than the original database (if not count node-links and
counts)
36
Mining Frequent Patterns Using FP-tree
• General idea (divide-and-conquer)
– Recursively grow frequent pattern path using the
FP-tree
• Method
– For each item, construct its conditional pattern-
base, and then its conditional FP-tree
– Repeat the process on each newly created
conditional FP-tree
– Until the resulting FP-tree is empty, or it contains
only one path (single path will generate all the combinations of
its sub-paths, each of which is a frequent pattern)
37
Major Steps to Mine FP-tree
Header Table {}
{} m-conditional pattern
Header Table base:
Item frequency head f:4 c:1 fca:2, fcab:1
f 4 All frequent patterns
c 4 c:3 b:1 b:1 {} concerning m
m,
a 3
b 3 a:3 p:1 f:3 fm, cm, am,
fcm, fam, cam,
m 3
p 3 m:2 b:1 c:3 fcam
41
Mining Frequent Patterns by Creating Conditional
Pattern-Bases
42
Step 3: Recursively mine the conditional FP-tree
{}
{}
Cond. pattern base of “am”: (fc:3) f:3
f:3 c:3
c:3 am-conditional FP-tree
{}
a:3 Cond. pattern base of “cm”: (f:3)
m-conditional FP-tree
f:3
cm-conditional FP-tree
{}
43
Single FP-tree Path Generation
m-conditional FP-tree 44
Principles of Frequent Pattern Growth
• Pattern growth property
– Let be a frequent itemset in DB, B be 's conditional pattern base,
and be an itemset in B. Then is a frequent itemset in DB iff
is frequent in B.
• “abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”
45
Benefits of the FP-tree Structure
• Completeness
– Preserve complete information for frequent pattern mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more frequently
occurring, the more likely to be shared
– Never be larger than the original database (not count node-
links and the count field)
46
FP-Growth vs. Apriori: Scalability With the Support
Threshold
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
47
FP-Growth vs. Tree-Projection: Scalability with the
Support Threshold
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%) 48
Advantages of the Pattern Growth Approach
• Divide-and-conquer:
– Decompose both the mining task and DB according to the frequent
patterns obtained so far
– Lead to focused search of smaller databases
• Other factors
– No candidate generation, no candidate test
– Compressed database: FP-tree structure
– No repeated scan of entire database
– Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
• A good open-source implementation and refinement of FPGrowth
– FPGrowth+ (Grahne and J. Zhu, FIMI'03)
49
ECLAT: Mining by Exploring Vertical Data Format
50
Mining Frequent Closed Patterns: CLOSET
55
Interestingness Measure: Correlations (Lift)
57