0% found this document useful (0 votes)

33 views26 pages

Lec5 Association Rule

The document describes the Apriori algorithm for mining association rules from transactional data. The algorithm works as follows: 1. It finds all frequent itemsets by iterating through items of increasing size (1-itemsets, 2-itemsets etc.) and counting their support. 2. It employs an "Apriori principle" where any subset of a frequent itemset must also be frequent, allowing infrequent itemsets and their supersets to be pruned. 3. Candidate itemsets are generated by joining frequent itemsets from the previous pass. Frequent itemsets are determined by minimum support. The algorithm efficiently discovers all association rules that satisfy minimum support and confidence thresholds without considering all possible

Uploaded by

Shanti Grover

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views26 pages

Lec5 Association Rule

Uploaded by

Shanti Grover

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Mining

Association Rules
Motivation
 Discovering relations among transactional data
 Example – market basket analysis
Discovery of buying habits of customers: what items
are frequently purchased by a customer in a single
trip?
Help developing market strategies
 Issues:
How to formulate association rules
How to determine interesting association rules
How to discover interesting association rules
efficiently in large data set? 2
Formulating Association Rules
 Example: a customer that 1 coffee, bread
purchases coffee tends to 2 coffee, meat, apple
also buy sugar is 3 coffee, sugar, noodle, salt
represented as: 4 coffee, sugar, orange, potato
coffee  sugar [support = 10%, 5 coffee, sugar, tomato
confidence = 70%] 6 bread, sugar, bean
 support = 10%: 10% of all 7 milk, egg
customers purchase both 8 milk, fish
coffee and sugar
 confidence = 70%: 70% of the
customers who buy coffee also Total customers: 8
buy sugar Customers who bought coffee: 5
Customers who bought both
 Thresholds: support must be at
coffee and sugar: 3
least r, confidence at least c
Support: 3/8 = 37%
 Users set thresholds to
Confidents: 3/5 = 60%
indicate interestingness
3
Formulating Association Rule (cont.)
 In terms of probability
Let X = (X1, X2) be defined as
 For a random customer c, X1 = 1 if c buys coffee,
and 0 otherwise; X2 = 1 if c buys sugar, 0
otherwise
coffee  sugar [support = 10%, confidence
= 70%] is interpreted as:
 p(X1 = 1, X2 = 1) = 10% and p(X2 = 1|X1 = 1) = 70%
 or simply
 p(coffee, sugar) = 10% and p(sugar | coffee) = 70%

4
Formulating Association Rule (cont.)
 Concepts
 I = {i1,…, im} is a set of items
 D = {T1,…, Tn} is a set where for all i, Ti  I. (Ti is called a
transaction, D is referred as a transaction database.)
 An association rule is an implication: A  B where A, B 
I and A  B = 
 A  B holds in D with a support s and confidence r if
 | {T : A  B  T & T  D} | = s and | {T : A  B  T & T  D} | = r
|D| | {T : A  T & T  D} |

 If we view any U  I as the event that a randomly

selected transaction from D contains U, then p(AB) =
s and p(B|A) = r

5
I = {i1,…, im}
Formulating Association Rule (cont.) D = {T1,…, Tn}
A  I, B  I
 Association rule A => B is valid with respect AB=
to the support threshold r and confidence threshold c if
 A => B holds with a support s  r and confidence f  c

 Additional concepts
 k-itemset: any subset of I that contains exactly k items
 Occurrence frequency of itemset t, denoted as frequency(t): #
of transactions in D that contain t (other terms used: support
count)
 Itemset t is frequent with respect to support threshold r if
frequency(t)/|D|  r
 Implication: A  B being frequent with respect to r is a
necessary condition for A => B to be valid
6
Formulating Association Rule
 Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
 Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
 Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
 Consider association rule {coffee} => {sugar} 6 bread, sugar, bean
 The occ. freq. of (coffee, sugar} is 3 7 milk, egg
 {coffee, sugar} is a frequent 2-itemset, since 8 milk, fish
3/8  30%
 The occurrence frequency of {coffee} is 5
 The confidence for {coffee} => {sugar} is 3/5 
60%
 So, {coffee} => {sugar} is a valid association
rule w. r. t the giving support the confidence
threshold
Formulating Association Rule
 Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
 Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
 Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
 Consider association rule {milk} => {egg} 6 bread, sugar, bean
 The occu. freq. of {milk, egg} is 1 7 milk, egg
 {milk, egg} is not a frequent 2-itemset, since 8 milk, fish
1/8 < 30%
 {milk} => {egg} is not a valid association rule
w.r.t the given thresholds
Mining Association Rules
 Goal: discover all the valid association rules with
respect to the given support threshold r and
confidence threshold c
 Steps:
Find all frequent item sets w.r.t. r
 Generate association rules from the frequent item sets
w.r.t c
 Approaches to frequent item set searching
 Naive approach
 scan the itemset space
 for each itemset, count its frequency (scan all the
transactions), and compare with r
 high cost – # of itemsets is huge
9
An naive approach for finding all
frequent itemsets ??
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE 10
Apriori Algorithm for AR Mining
 Apriori property
 Let t1 and t2 be any itemsets and t2  t1. Then
t1 is frequent  t2 is frequent
or equivalently, t2 is not frequent  t1 is not frequent
 So if we know that an itemset is not frequent, then no need
to check its supersets
 Based on the second step, we can prune the search space
 After pruning, the remaining itemsets are called
candidate itemsets
 For each candidate itemset, we count the transactions
that contain it to determine if it is frequent

11
Illustrating the Apriori principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
not frequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE
12
Apriori Algorithm (cont.)
 Assumes the items are ordered in any itemset
as well as transactions
 Work out in the ascending order of i-itemsets
1. Find all the frequent 1-itemsets (by counting)
2. Join (i.e., union) each pair of frq 1-itemsets into a 2-itemset
3. Join each pair of frq (k-1)-itemsets into a k-itemset
4. Among them generate candidate k-itemsets
5. Get the transaction count for each candidate k-itemset and
then collect the frequent ones
6. Repeat these process until candidate sets become 
 Issues
 How to join (step 3)?
 How to generate (step 4)?
13
Apriori Algorithm (cont.)
 Let U and V be a pair of (k-1)-itemsets, we join them
in the following way
 Condition: they share the first k-2 items
 Keep these k-2 items, then add the remaining two items,
from one set each
 Example:
 join {1,4,5,7} and {1,4,5,9}, ok, get {1,4,5,7,9}
 join {1,4,5,7} and {1,2,4,8}, no
 join {1,4,5,7} and {4,5,7,9}, no
 Let W be the resulting set after joining U and V
 discard it if one of its (k-1)-subitemsets is not frequent
(this is where apriori property is applied)
 all the k-itemsets that have not been discarded constitute
the candidate k-itemsets
14
Apriori Algorithm – an Example
 I = {1,2,3,4,5}
 D = { {1,2,3,4}, {1,2,4}, {2,4,5}, {1,2,5},{2,4} }
 Support threshold: 40% (min support count: 2)
 Steps
1. 1-itemsets: {1}, {2}, {3}, {4}, {5}
2. Frequent 1-itemsets: {1}, {2}, {4}, {5}
3. Join frq 1-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
4. Candidate 2-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
5. Frequent 2-itemsets: {1,2}, {1,4}, {2,4}, {2,5}
6. Join frq 2-itemsets: {1,2,4}, {2,4,5}
7. Candidate 3-itemsets: {1,2,4}
8. Frequent 3-itemsets: {1,2,4}
9. Join frq 3-itemsets: 
10.Candidate 4-itemsets: 
11.stop
15
Correctness
 Does Apriori algorithm find all frequent itemsets?
 i.e., does the candidate k-itemsets include all the
frequent k-itemsets?
 We require two (k-1)-itemsets U and V to share the first k-2
items to be joined. Does this condition jeopardize correctness?
 Suppose U and V do not share the first k-2 items, let W = U  V
be a k-itemset. It will not be generated from joining U and V.
 Case 1, W is not frequent: not a problem.
 Case 2, W is frequent: can we conclude that its frequent itemset
status will not be discovered?

16
Generating Association Rules
 Let S be any frequent itemset
For each a  S, calculate

𝑓𝑟𝑒𝑞 𝑆
𝑓𝑟𝑒𝑞 𝑎
 If this value is not smaller than the
confidence threshold then output the
following association rule:
aS–a

17
Pattern Evaluation
 Support and confidence framework can only
help exclude uninteresting rules
 But they do not necessarily guarantee the
interestingness of the rules generated
 How to make a judgement?
❖ Mostly determined by users subjectively
❖ May be different by different users
❖ Some objective measures may be used in
limited contexts
18
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading when
 The overall % of students eating cereal is 75% > 66.7%.

 play basketball  not eat cereal [20%, 33.3%] is more accurate,

although with lower support and confidence
 Measure of dependent/correlated events: lift (larger -> higher correlation)

𝑃(𝑈, 𝑉) Basketball Not basketball Sum (row)

lift = Cereal 2000 1750 3750
𝑃 𝑈 𝑃(𝑉)
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C ) = = 0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C ) = = 1.33
3000 / 5000 *1250 / 5000
19
2 correlation Test for A and B
 Notation:
❖ n: total # of transactions
❖ Dom(A) = {a1, …,ac)
❖ Dom(B) = {b1, …, br)
❖ (Ai, Bj): joint event that A = ai and B = bj
2
𝑎𝑖𝑗 −𝑒𝑖𝑗
2= σ𝑐𝑖=1 σ𝑟𝑗=1
𝑒𝑖𝑗
where
𝑎𝑖𝑗 : observed frequency of event (Ai, Bj)
𝑐𝑜𝑢𝑛𝑡(𝐴=𝑎 )×𝑐𝑜𝑢𝑛𝑡(𝐵=𝑏 )
𝑖 𝑖
𝑒𝑖𝑗 = : expected frequency of (Ai, Bj)
𝑛
𝑐𝑜𝑢𝑛𝑡(𝐴 = 𝑎𝑖 ): # of tuples with 𝐴 = 𝑎𝑖
𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝑏𝑖 ): # of tuples with 𝐵 = 𝑏𝑖

Common practice: A and B are corelated if the p-value of 2 with

(c-1)(r-1) degrees of freedom is smaller than 0.05 20
• Let B and C be two random variables and
• Dom(B) = {Basketball, Not-basketball}
• Dom(C) = {Cereal, Not-cereal}
• The following is the contingency table:
Basketball Not-basketball Sum (row)
Cereal 2000 (2250) 1750 (1500) 3750

Not-cereal 1000 (750) 250 (500) 1250

Sum(col.) 3000 2000 5000

(2000−2250)2 (1750−1500)2 (1000−750)2 (250−500)2

 2= + + +
2250 1500 750 500
= 180.56

• p-value of 180.56 with one degree of freedom  0.05

• so B and C are strongly correlated
• Observing the data, they are negatively correlated 21
Multi-level AR
 Association rules may involve concepts at different
abstraction levels

22
Multi-level AR

 In some cases, it is difficult to find interesting patterns in

very low levels
 It may be easier to find strong associations between
general concepts
❖ Example:
❖ laptop => printer may be a strong rule
❖ Dell XPS 16 Notebook => Canon 7420 may be not

TID Items purchased

T100 Apple 17 Pro Notebook, HP Photosmart Pro b9180, Canon 7420 Printer
T200 Microsoft Office Pro 2010, Microsoft Wireless Optical Mouse 5000
T300 Logitech VX Namo Cordless Laser Mouse, Fellowes CEL Wrist Rest
T400 Dell Studio XPS 16 Notebook, Canon PowerShot SD1400
T500 Lenovo ThinkPad X200 Tablet PC, Symantec Norton Antivirus 2010
… 23
Multi-level AR
 Multilevel AR can be mined efficiently using support-confidence
framework
 Either top-down or bottom-up approach can be used
 Counts are accumulated toward frequent itemset at each level
 For each level, any AR algorithm can be used
 We can also define cross-level Apriori property
❖ Cross-level Apriori property: the count of any item set is not higher than its
parent, so the parent of a frequent item set is frequent also
❖ Example: frequency(Desktop, Office) ≤ frequency(Computer, Software)

24
Multi-level AR
 Variation 1: Uniform minimum support for all levels
❖ Pros: simplicity
❖ Cons: lower level concepts unlikely to occur with same frequency
as higher level concepts

25
Multi-level AR
 Variation 2: reduced minimum support at lower levels
❖ Pros: higher flexibility
❖ Cons: increased complexity in mining process
❖ Note: Apriori property may not always hold

 Variation 3: group-based support

❖ Domain experts have insight on the specificities of individual items
❖ Setting different supports for different groups may be more realistic
❖ For example, you may set low support for expansive items

Lesson Notes-Computer Science JSS1 First Term
No ratings yet
Lesson Notes-Computer Science JSS1 First Term
30 pages
Nosql Iontro
No ratings yet
Nosql Iontro
55 pages
Operator Guide - Nozzles - EN
100% (1)
Operator Guide - Nozzles - EN
16 pages
Unit-2: Tools and Methods Used in Cybercrimes
No ratings yet
Unit-2: Tools and Methods Used in Cybercrimes
149 pages
Netmanias.2013.02.15 LTE Protocol Stack UE Side (E)
100% (1)
Netmanias.2013.02.15 LTE Protocol Stack UE Side (E)
1 page
Proposal For Note Sharing System
No ratings yet
Proposal For Note Sharing System
13 pages
Sap Retail
No ratings yet
Sap Retail
4 pages
@vtucode - in BESCK104E 204E Model Set 1 Paper
No ratings yet
@vtucode - in BESCK104E 204E Model Set 1 Paper
2 pages
Data Base Concepts
No ratings yet
Data Base Concepts
51 pages
Netsuite Practice Set
No ratings yet
Netsuite Practice Set
7 pages
FEU EAC ITES103 ITEI103 Flowcharting and Pseudocoding
No ratings yet
FEU EAC ITES103 ITEI103 Flowcharting and Pseudocoding
71 pages
Sitefinity 5 Nuts & Bolts PDF
No ratings yet
Sitefinity 5 Nuts & Bolts PDF
555 pages
UNV Camera - MOP NEW
No ratings yet
UNV Camera - MOP NEW
11 pages
8 Secrets To Making An ATS-Friendly Resume - The Muse
No ratings yet
8 Secrets To Making An ATS-Friendly Resume - The Muse
5 pages
12 Practical Web Design
No ratings yet
12 Practical Web Design
16 pages
Using The 2130 Remote Display Program
No ratings yet
Using The 2130 Remote Display Program
1 page
Manual
No ratings yet
Manual
241 pages
Image Intent Query Labeling: Objective
No ratings yet
Image Intent Query Labeling: Objective
4 pages
Stakeholder Request
No ratings yet
Stakeholder Request
7 pages
Using CIDR Notation To Determine The Subnet Mask
No ratings yet
Using CIDR Notation To Determine The Subnet Mask
2 pages
Googolopoly
No ratings yet
Googolopoly
8 pages
GPU-Z Sensor Log
No ratings yet
GPU-Z Sensor Log
29 pages
Data Structures UNIT-1: Recursion: Introduction, Format of Recursive Functions, Recursion vs. Iteration, Examples
No ratings yet
Data Structures UNIT-1: Recursion: Introduction, Format of Recursive Functions, Recursion vs. Iteration, Examples
7 pages
PPPT
No ratings yet
PPPT
14 pages
TDMS File Format Internal Structure
No ratings yet
TDMS File Format Internal Structure
14 pages
Meade DSI Pro II Fix
No ratings yet
Meade DSI Pro II Fix
15 pages
Cv...... - 1658212575
No ratings yet
Cv...... - 1658212575
3 pages
Simplified EasyBuild Guide - English
No ratings yet
Simplified EasyBuild Guide - English
3 pages
Specification For 40 Inch Stand Digital Poster - Windro.
No ratings yet
Specification For 40 Inch Stand Digital Poster - Windro.
2 pages
Office Building: 1 01. Ground Floor
No ratings yet
Office Building: 1 01. Ground Floor
1 page

Lec5 Association Rule

Uploaded by

Lec5 Association Rule

Uploaded by

Mining

 If we view any U  I as the event that a randomly

ABCD ABCE ABDE ACDE BCDE

ABCD ABCE ABDE ACDE BCDE

 play basketball  not eat cereal [20%, 33.3%] is more accurate,

𝑃(𝑈, 𝑉) Basketball Not basketball Sum (row)

Common practice: A and B are corelated if the p-value of 2 with

Not-cereal 1000 (750) 250 (500) 1250

Sum(col.) 3000 2000 5000

(2000−2250)2 (1750−1500)2 (1000−750)2 (250−500)2

• p-value of 180.56 with one degree of freedom  0.05

 In some cases, it is difficult to find interesting patterns in

TID Items purchased

 Variation 3: group-based support

You might also like