Mining
Association Rules
Motivation
Discovering relations among transactional data
Example – market basket analysis
Discovery of buying habits of customers: what items
are frequently purchased by a customer in a single
trip?
Help developing market strategies
Issues:
How to formulate association rules
How to determine interesting association rules
How to discover interesting association rules
efficiently in large data set? 2
Formulating Association Rules
Example: a customer that 1 coffee, bread
purchases coffee tends to 2 coffee, meat, apple
also buy sugar is 3 coffee, sugar, noodle, salt
represented as: 4 coffee, sugar, orange, potato
coffee sugar [support = 10%, 5 coffee, sugar, tomato
confidence = 70%] 6 bread, sugar, bean
support = 10%: 10% of all 7 milk, egg
customers purchase both 8 milk, fish
coffee and sugar
confidence = 70%: 70% of the
customers who buy coffee also Total customers: 8
buy sugar Customers who bought coffee: 5
Customers who bought both
Thresholds: support must be at
coffee and sugar: 3
least r, confidence at least c
Support: 3/8 = 37%
Users set thresholds to
Confidents: 3/5 = 60%
indicate interestingness
3
Formulating Association Rule (cont.)
In terms of probability
Let X = (X1, X2) be defined as
For a random customer c, X1 = 1 if c buys coffee,
and 0 otherwise; X2 = 1 if c buys sugar, 0
otherwise
coffee sugar [support = 10%, confidence
= 70%] is interpreted as:
p(X1 = 1, X2 = 1) = 10% and p(X2 = 1|X1 = 1) = 70%
or simply
p(coffee, sugar) = 10% and p(sugar | coffee) = 70%
4
Formulating Association Rule (cont.)
Concepts
I = {i1,…, im} is a set of items
D = {T1,…, Tn} is a set where for all i, Ti I. (Ti is called a
transaction, D is referred as a transaction database.)
An association rule is an implication: A B where A, B
I and A B =
A B holds in D with a support s and confidence r if
| {T : A B T & T D} | = s and | {T : A B T & T D} | = r
|D| | {T : A T & T D} |
If we view any U I as the event that a randomly
selected transaction from D contains U, then p(AB) =
s and p(B|A) = r
5
I = {i1,…, im}
Formulating Association Rule (cont.) D = {T1,…, Tn}
A I, B I
Association rule A => B is valid with respect AB=
to the support threshold r and confidence threshold c if
A => B holds with a support s r and confidence f c
Additional concepts
k-itemset: any subset of I that contains exactly k items
Occurrence frequency of itemset t, denoted as frequency(t): #
of transactions in D that contain t (other terms used: support
count)
Itemset t is frequent with respect to support threshold r if
frequency(t)/|D| r
Implication: A B being frequent with respect to r is a
necessary condition for A => B to be valid
6
Formulating Association Rule
Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
Consider association rule {coffee} => {sugar} 6 bread, sugar, bean
The occ. freq. of (coffee, sugar} is 3 7 milk, egg
{coffee, sugar} is a frequent 2-itemset, since 8 milk, fish
3/8 30%
The occurrence frequency of {coffee} is 5
The confidence for {coffee} => {sugar} is 3/5
60%
So, {coffee} => {sugar} is a valid association
rule w. r. t the giving support the confidence
threshold
Formulating Association Rule
Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
Consider association rule {milk} => {egg} 6 bread, sugar, bean
The occu. freq. of {milk, egg} is 1 7 milk, egg
{milk, egg} is not a frequent 2-itemset, since 8 milk, fish
1/8 < 30%
{milk} => {egg} is not a valid association rule
w.r.t the given thresholds
Mining Association Rules
Goal: discover all the valid association rules with
respect to the given support threshold r and
confidence threshold c
Steps:
Find all frequent item sets w.r.t. r
Generate association rules from the frequent item sets
w.r.t c
Approaches to frequent item set searching
Naive approach
scan the itemset space
for each itemset, count its frequency (scan all the
transactions), and compare with r
high cost – # of itemsets is huge
9
An naive approach for finding all
frequent itemsets ??
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE 10
Apriori Algorithm for AR Mining
Apriori property
Let t1 and t2 be any itemsets and t2 t1. Then
t1 is frequent t2 is frequent
or equivalently, t2 is not frequent t1 is not frequent
So if we know that an itemset is not frequent, then no need
to check its supersets
Based on the second step, we can prune the search space
After pruning, the remaining itemsets are called
candidate itemsets
For each candidate itemset, we count the transactions
that contain it to determine if it is frequent
11
Illustrating the Apriori principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
not frequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
supersets ABCDE
12
Apriori Algorithm (cont.)
Assumes the items are ordered in any itemset
as well as transactions
Work out in the ascending order of i-itemsets
1. Find all the frequent 1-itemsets (by counting)
2. Join (i.e., union) each pair of frq 1-itemsets into a 2-itemset
3. Join each pair of frq (k-1)-itemsets into a k-itemset
4. Among them generate candidate k-itemsets
5. Get the transaction count for each candidate k-itemset and
then collect the frequent ones
6. Repeat these process until candidate sets become
Issues
How to join (step 3)?
How to generate (step 4)?
13
Apriori Algorithm (cont.)
Let U and V be a pair of (k-1)-itemsets, we join them
in the following way
Condition: they share the first k-2 items
Keep these k-2 items, then add the remaining two items,
from one set each
Example:
join {1,4,5,7} and {1,4,5,9}, ok, get {1,4,5,7,9}
join {1,4,5,7} and {1,2,4,8}, no
join {1,4,5,7} and {4,5,7,9}, no
Let W be the resulting set after joining U and V
discard it if one of its (k-1)-subitemsets is not frequent
(this is where apriori property is applied)
all the k-itemsets that have not been discarded constitute
the candidate k-itemsets
14
Apriori Algorithm – an Example
I = {1,2,3,4,5}
D = { {1,2,3,4}, {1,2,4}, {2,4,5}, {1,2,5},{2,4} }
Support threshold: 40% (min support count: 2)
Steps
1. 1-itemsets: {1}, {2}, {3}, {4}, {5}
2. Frequent 1-itemsets: {1}, {2}, {4}, {5}
3. Join frq 1-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
4. Candidate 2-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
5. Frequent 2-itemsets: {1,2}, {1,4}, {2,4}, {2,5}
6. Join frq 2-itemsets: {1,2,4}, {2,4,5}
7. Candidate 3-itemsets: {1,2,4}
8. Frequent 3-itemsets: {1,2,4}
9. Join frq 3-itemsets:
10.Candidate 4-itemsets:
11.stop
15
Correctness
Does Apriori algorithm find all frequent itemsets?
i.e., does the candidate k-itemsets include all the
frequent k-itemsets?
We require two (k-1)-itemsets U and V to share the first k-2
items to be joined. Does this condition jeopardize correctness?
Suppose U and V do not share the first k-2 items, let W = U V
be a k-itemset. It will not be generated from joining U and V.
Case 1, W is not frequent: not a problem.
Case 2, W is frequent: can we conclude that its frequent itemset
status will not be discovered?
16
Generating Association Rules
Let S be any frequent itemset
For each a S, calculate
𝑓𝑟𝑒𝑞 𝑆
𝑓𝑟𝑒𝑞 𝑎
If this value is not smaller than the
confidence threshold then output the
following association rule:
aS–a
17
Pattern Evaluation
Support and confidence framework can only
help exclude uninteresting rules
But they do not necessarily guarantee the
interestingness of the rules generated
How to make a judgement?
❖ Mostly determined by users subjectively
❖ May be different by different users
❖ Some objective measures may be used in
limited contexts
18
Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading when
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift (larger -> higher correlation)
𝑃(𝑈, 𝑉) Basketball Not basketball Sum (row)
lift = Cereal 2000 1750 3750
𝑃 𝑈 𝑃(𝑉)
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C ) = = 0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C ) = = 1.33
3000 / 5000 *1250 / 5000
19
2 correlation Test for A and B
Notation:
❖ n: total # of transactions
❖ Dom(A) = {a1, …,ac)
❖ Dom(B) = {b1, …, br)
❖ (Ai, Bj): joint event that A = ai and B = bj
2
𝑎𝑖𝑗 −𝑒𝑖𝑗
2= σ𝑐𝑖=1 σ𝑟𝑗=1
𝑒𝑖𝑗
where
𝑎𝑖𝑗 : observed frequency of event (Ai, Bj)
𝑐𝑜𝑢𝑛𝑡(𝐴=𝑎 )×𝑐𝑜𝑢𝑛𝑡(𝐵=𝑏 )
𝑖 𝑖
𝑒𝑖𝑗 = : expected frequency of (Ai, Bj)
𝑛
𝑐𝑜𝑢𝑛𝑡(𝐴 = 𝑎𝑖 ): # of tuples with 𝐴 = 𝑎𝑖
𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝑏𝑖 ): # of tuples with 𝐵 = 𝑏𝑖
Common practice: A and B are corelated if the p-value of 2 with
(c-1)(r-1) degrees of freedom is smaller than 0.05 20
• Let B and C be two random variables and
• Dom(B) = {Basketball, Not-basketball}
• Dom(C) = {Cereal, Not-cereal}
• The following is the contingency table:
Basketball Not-basketball Sum (row)
Cereal 2000 (2250) 1750 (1500) 3750
Not-cereal 1000 (750) 250 (500) 1250
Sum(col.) 3000 2000 5000
(2000−2250)2 (1750−1500)2 (1000−750)2 (250−500)2
2= + + +
2250 1500 750 500
= 180.56
• p-value of 180.56 with one degree of freedom 0.05
• so B and C are strongly correlated
• Observing the data, they are negatively correlated 21
Multi-level AR
Association rules may involve concepts at different
abstraction levels
22
Multi-level AR
In some cases, it is difficult to find interesting patterns in
very low levels
It may be easier to find strong associations between
general concepts
❖ Example:
❖ laptop => printer may be a strong rule
❖ Dell XPS 16 Notebook => Canon 7420 may be not
TID Items purchased
T100 Apple 17 Pro Notebook, HP Photosmart Pro b9180, Canon 7420 Printer
T200 Microsoft Office Pro 2010, Microsoft Wireless Optical Mouse 5000
T300 Logitech VX Namo Cordless Laser Mouse, Fellowes CEL Wrist Rest
T400 Dell Studio XPS 16 Notebook, Canon PowerShot SD1400
T500 Lenovo ThinkPad X200 Tablet PC, Symantec Norton Antivirus 2010
… 23
Multi-level AR
Multilevel AR can be mined efficiently using support-confidence
framework
Either top-down or bottom-up approach can be used
Counts are accumulated toward frequent itemset at each level
For each level, any AR algorithm can be used
We can also define cross-level Apriori property
❖ Cross-level Apriori property: the count of any item set is not higher than its
parent, so the parent of a frequent item set is frequent also
❖ Example: frequency(Desktop, Office) ≤ frequency(Computer, Software)
24
Multi-level AR
Variation 1: Uniform minimum support for all levels
❖ Pros: simplicity
❖ Cons: lower level concepts unlikely to occur with same frequency
as higher level concepts
25
Multi-level AR
Variation 2: reduced minimum support at lower levels
❖ Pros: higher flexibility
❖ Cons: increased complexity in mining process
❖ Note: Apriori property may not always hold
Variation 3: group-based support
❖ Domain experts have insight on the specificities of individual items
❖ Setting different supports for different groups may be more realistic
❖ For example, you may set low support for expansive items
26