Association
Association
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Introduction
Data required to learn about the purchasing behavior of
customers.
Useful for marketing promotions, inventory management,
and customer relationship management.
Association analysis, is useful for discovering interesting
relationships hidden in large data sets.
Relationships are represented as Association rules or set
of frequent items.
{Diapers} ~ {Beer}
The purchasing of one product when another product is
purchased represents an association rule.
Market Basket Analysis
Rule form
Antecedent Consequent [support, confidence]
(support and confidence are user defined measures of interestingness)
Given:
(1) database of transactions,
(2) each transaction is a list of items purchased by a
customer in a visit
Find:
all rules that correlate the presence of one set of items
(itemset) with that of another set of items
E.g., 35% of people who buys salmon also buys cheese
The model: data
Frequent Itemset
An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule
Association Rule TID Items
Support Confidence
Calculation Calculation
a. 3/5=0.6 a. 3/4= 0.75
b. 3/5=0.6 b. 3/3=1
c. 1/5=0.2 c. 1/2 = 0.5
d. 1/5=0.2 d. 1/3 = 0.33
e. 1/5=0.2 e. 1/1=1
f. 0 f. 0
Example
Why Support and Confidence
Support
is an important measure because a rule that has very low support may
occur simply by chance.
A low support rule is also likely to be uninteresting from a business
perspective because it may not be profitable to promote items that
customers seldom by together.
For these reasons, support is often used to eliminate uninteresting
rules.
Confidence,
measures the reliability of the inference made by a rule.
For a given rule X ~ Y, tbe higher tbe confidence, the more likely it is
for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of
Y given X.
Association Rule Mining Problem
where minsup and minconf are the corresponding support and confidence
thresholds.
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Computational Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
d
d1 d k
d k
R
k j
k 1 j 1
3 2 1
d d 1
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
Frequent Itemset Generation
• Generate all itemsets whose support minsup.
• These itemsets are called frequent itemsets.
Rule Generation
• Generate high confidence rules from each frequent
itemset.
• These rules are called strong rules.
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Match w
each transaction against every candidate
If the candidate is contained in a transaction, its support count
will be incremented.
Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
Illustrating Apriori Principle
Illustrating Apriori Principle
null
If an itemset is
infrequent, then all of A B C D E
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
Example
Step-1: K=1
(I) Create a table containing support count of each item
present in dataset – Called C1(candidate set)
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining L k-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common.
So here, for L3, first 2 elements (items) should match.
Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3,
I5}, which is not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further
Example
We have discovered all the frequent item-sets.
Now generation of strong association rule comes into picture.
For that we need to calculate confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who
purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
Example
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) =
2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) =
2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) =
2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered
as strong association rules.
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 2
Minimum Support = 2
Confidence
C3 Itemset
3rd scan L3 Itemset sup
49 {B, C, E} 2
{B, C, E}
Is Apriori Fast Enough? —
Performance Bottlenecks
The core of the Apriori algorithm:
Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
Use database scan and pattern matching to collect counts for
the candidate itemsets
The bottleneck of Apriori: Candidate generation
Huge candidate sets
Multiple scans of database
Problems with the association mining
Rare Item Problem: It assumes that all items in the data are
of the same nature and/or have similar frequencies.