Week 6 - Basic Association Analysis
Week 6 - Basic Association Analysis
Since 2004
Duc-Trong Le
Hanoi, 09/2021
Association Rule Mining
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
2
Definition: Frequent Itemset
● Itemset
– A collection of one or more items
◆ Example: {Milk, Bread, Diaper}
– k-itemset
◆ An itemset that contains k items
● Support count (σ)
– Frequency of occurrence of an itemset
4
Association Rule Mining Task
● Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each
rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive! 5
Computational Complexity
● Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
6
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
● All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
● Rules originating from the same itemset have identical support but
can have different confidence
● Thus, we may decouple the support and confidence requirements
7
Mining Association Rules
● Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
8
Frequent Itemset Generation
11
Reducing Number of Candidates
● Apriori principle:
– If an itemset is frequent, then all of its subsets
must also be frequent
Found to be
Infrequent
Pruned
supersets
13
Illustrating Apriori Principle
Items (1-itemsets)
Minimum Support = 3
14
Illustrating Apriori Principle
Items (1-itemsets)
Minimum Support = 3
15
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
16
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
17
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
18
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
19
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
6 + 6 + 1 = 13
20
Apriori Algorithm
22
Candidate Generation: Merge Fk-1 and F1 itemsets
23
Candidate Generation: Fk-1 x Fk-1 Method
● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
24
Candidate Pruning
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
be the set of frequent 3-itemsets
● Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
26
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered,
6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support counting
step.
27
Alternate Fk-1 x Fk-1 Method
● F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
28
Candidate Pruning for Alternate Fk-1 x Fk-1 Method
● Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
be the set of frequent 3-itemsets
30
Support Counting of Candidate Itemsets
31
Support Counting: An Example
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
32
Rule Generation
Lattice of rules
Low
Confidence
Rule
Pruned
Rules
35
Association Analysis: Basic Concepts
and Algorithms
36
Factors Affecting Complexity of Apriori
● Size of database
37
Factors Affecting Complexity of Apriori
● Size of database
–
● Average transaction width
–
38
Impact of Support Based Pruning
Items (1-itemsets)
Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6
C1 + 6C2 + 6C3
If every subset is considered,
6 + 15 + 20 = 41 6
C1 + 6C2 + 6C3 + 6C4
With support-based pruning,
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16
39
Factors Affecting Complexity of Apriori
40
Factors Affecting Complexity of Apriori
41
Factors Affecting Complexity of Apriori
42
Factors Affecting Complexity of Apriori
43
Compact Representation of Frequent Itemsets
44
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its
immediate supersets is frequent
Maximal
Itemsets
Infrequent
Itemsets Border
45
What are the Maximal Frequent Itemsets in this Data?
(A1-A10)
(B1-B10)
(C1-C10)
46
An illustrative example
Items
5
6
7
8
9
10
47
An illustrative example
Items
Maximal itemsets: ?
5
6
7
8
9
10
48
An illustrative example
Items
{E,F}, {J}
5
Maximal itemsets: {E,F}, {J}
6
7 Support threshold (by count): 3
Frequent itemsets: ?
8
Maximal itemsets: ?
9
10
49
An illustrative example
Items
{J}
5
Maximal itemsets: {E,F}, {J}
6
7 Support threshold (by count): 3
Frequent itemsets:
8
All subsets of {C,D,E,F} + {J}
9 Maximal itemsets:
10 {C,D,E,F}, {J}
50
Another illustrative example
Items
51
Closed Itemset
52
Closed Itemset
53
Maximal Frequent vs Closed Frequent Itemsets
Closed and
maximal
# Closed frequent = 9
# Maximal freaquent = 4
54
What are the Closed Itemsets in this Data?
(A1-A10)
(B1-B10)
(C1-C10)
55
Maximal vs Closed Itemsets
56
Pattern Evaluation
57
Computing Interestingness Measure
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
60
Drawback of Confidence
Custo Tea Honey …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …
…
● The criterion
confidence(X → Y) = support(Y)
is equivalent to:
– P(Y|X) = P(Y)
– P(X,Y) = P(X) × P(Y) (X and Y are
independent)
63
If P(X,Y) < P(X) × P(Y) : X & Y are negatively
Measures that take into account statistical dependence
64
Example: Lift/Interest
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
66
Simpson’s Paradox
67
Simpson’s Paradox
68
Simpson’s Paradox
69
Simpson’s Paradox
70
Simpson’s Paradox
71