Module-IV (Frequent Pattern & Association Rule Mining)
Module-IV (Frequent Pattern & Association Rule Mining)
&
Association Rules
Frequent Item Set & Association Rules Mining
Association rule mining
Given a set of records each of which contain some number of items from a given collection of
transactions. Aim is to produce the dependency rules which will predict occurrence of an item
based on the occurrences of other items.
Transac ID Items bought Example of Association Rules
tion
T1 10 Beer, Nuts, Diaper
{Diaper} → {Beer},
{Milk, Bread} →{Eggs,Coke},
T2 20 Beer, Coke, Diaper
{Beer, Bread} → {Milk}
T3 30 Bread, Diaper, Eggs
T4 40 Nuts, Eggs, Milk
Implication (“->”) means co-occurrence,
T5 50 Nuts, Coffee, Diaper, Eggs, Milk
not causality!
The customers who purchase Bread have a chance to purchase Milk
The retail organizations are trying to find the assocation between products
which can be sold togother
Note: Assume all data are categorical. However, no good algorithm for numeric data.
Example of Association rule
The next step is to determine the relationships and the rules. So, association rule mining is applied in this context.
It is a procedure which aims to observe frequently occurring patterns, correlations, or associations from datasets
found in various kinds of databases such as relational databases, transactional databases, and other forms of
repositories.
8
Market-Basket Analysis
• Market Basket Analysis (Association Analysis) is a mathematical modeling
technique based upon the theory that if you buy a certain group of items, you
are likely to buy another group of items.
9
Market-Basket Analysis
• Consider shopping cart filled with several items
• Market basket analysis tries to answer the following questions:
• Who makes purchases?
• What items tend to be purchased together
• obvious: steak-potatoes; beer-pretzels
• What items are purchased sequentially
• obvious: house-furniture; car-tires
• In what order do customers purchase items?
• What items tend to be purchased by season
• It is also about what customers do not purchase, and why.
• If customers purchase baking powder, but no flour, what are they baking?
• If customers purchase a mobile phone, but no case, are you missing an
opportunity?
10
Market-Basket Analysis
11
Market-Basket Analysis
TID CID Date Item Qty
111 201 5/1/99 Pen 2
A database of customer transactions 111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
• Each transaction is a set of items 111 201 5/1/99 Juice 6
• Example: 112 105 6/3/99 Pen 1
Transaction with TID 111 contains 112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
items 113 106 6/5/99 Pen 1
{Pen, Ink, Milk, Juice} 113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4
12
Market-Basket Analysis
Co-occurrences
• 80% of all customers purchase items X, Y and Z together.
Association rules
• 60% of all customers who purchase X and Y also buy Z.
Sequential patterns
• 60% of customers who first buy X also purchase Y within
three weeks.
13
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied extensively by the database
and data mining community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how items purchased by
customers are related.
The number of transactions that include both items {A} and {B} as a percentage
of the total number of transactions. It is a measure of how frequently the collection of items
(here {A} & {B}) occur together as a percentage of all transactions.
probability that a transaction contains (A Ս B)
Example: Support(milk) = 6/9, Support(cheese) = 7/9, Support(milk & cheese) = 6/9.
Association Rule Mining Task
It is the ratio of the no of transactions that includes {A} and {B} to the no of
transactions that includes all items in {A}.
Confidence is the conditional probability that a transaction having A also contains B:- Pr(B|A)
Example: , Confidence(milk => cheese) = (milk & cheese)/(milk) = 6/ 6.
The lift of the rule A=>B is the confidence of the rule divided by the expected confidence,
assuming that the itemsets A and B are independent of each other.
Example: Lift(milk => cheese) = [(milk & cheese)/(milk) ]/[cheese/Total] = [6/6] / [7/9] =
1/0.777.
Class Exercise Transaction data: Supermarket data
19
Association Rule Mining Task
Association Rule Mining is a Two-step approach:
Join Step: This step generates (k+1) candidate itemset from k-itemsets by joining each item
with itself.
[The algorithm uses a level-wise search, where k-itemsets are used to explore (k+1)-itemsets.
The frequent subsets are extended one item at a time (this step is known as candidate generation
process) ]
Prune Step: This step scans the support count of each item in the database. If the candidate
item does not meet minimum support, then it is regarded as infrequent and thus is removed.
This step is performed to reduce the size of the candidate itemsets.
[From Candidate list of k-itemsets it extracts a Frequent list of k-itemsets using the support count]
Find the association rules from the final subsets or frequent itemset by calculating the
confidence value. The association rule which does not meet minimum confidence are
ignored.
Apriori Algorithm Problem
For the following given dataset generate rules using Apriori Algorithms. Consider support
value=2 (or 22%) and confidence = 50%
Apriori Algorithm Problem
Step-1:
Iteratio:1
Find from C1 that
Find that contains Support contains item set of length-1 those meet the
count of each item set of length-1 minimum support count.
C1 L1
Itemset Support Count Itemset Support Count
A 6 A 6
B 7 B 7
C 6
C 6
D 2
D 2
E 1
Step-1: Apriori Algorithm Problem contd...
Iteratio:2
Find with the help of L1 that Find from C2 that
contains item set of length-2 those meet the
contains Support count of each item set of
minimum support count.
length-2
C2 L2
Itemset Support Count Itemset Support Count
{A, B} 4
{A, B} 4
{A,C} 4
{A, D} 1 {A, C} 4
{B, C} 4 {B, C} 4
{B, D} 2
{B, D} 2
{C, D} 0
Apriori Algorithm Problem contd...
Step-1:
Iteratio:3 Find from C3 that
Find with the help of L2 that contains item set of length-3 those meet the
contains Support count of each item set of minimum support count.
length-3
C3 L3
As the given minimum threshold confidence is 50%, so the first three rules A^B → C, B^C → A,
and A^C → B can be considered as the strong association rules for the given problem.
APRIORI ALGORITHM EXAMPLE
29
Class Excercise
30
Class Exercise Solution
31
Generating Association Rules
From frequent item-sets
• Procedure 1:
• Let we have the list of frequent item-sets
32
Generating Association Rules
From frequent item-sets
• Procedure 2:
• For every nonempty subset S of I, output the rule:
S → (I - S)
• If support_count(I)/support_count(s)>= min_conf
where min_conf is minimum confidence threshold
• Let us assume:
• minimum confidence threshold is 60%
33
Association Rules with confidence
• R1 : 1,3 -> 5
– Confidence = sc{1,3,5}/sc{1,3} = 2/3 = 66.66% (R1 is selected)
• R2 : 1,5 -> 3
– Confidence = sc{1,5,3}/sc{1,5} = 2/2 = 100% (R2 is selected)
• R3 : 3,5 -> 1
– Confidence = sc{3,5,1}/sc{3,5} = 2/3 = 66.66% (R3 is selected)
• R4 : 1 -> 3,5
– Confidence = sc{1,3,5}/sc{1} = 2/3 = 66.66% (R4 is selected)
• R5 : 3 -> 1,5
– Confidence = sc{3,1,5}/sc{3} = 2/4 = 50% (R5 is REJECTED)
• R6 : 5 -> 1,3
– Confidence = sc{5,1,3}/sc{5} = 2/4 = 50% (R6 is REJECTED)
34
How to efficiently generate rules?
• In general, confidence does not have an anti-monotone property
c(ABC→D) can be larger or smaller than c(AB →D)
• But confidence of rules generated from the same item-set has an anti-
monotone property
• e.g., L= {A,B,C,D}
c(ABC→D) ≥ c(AB→CD) ≥ c(A→BCD)
Confidence is anti-monotone w.r.t. number of items on the RHS of the rule.
35
Rule generation for Apriori Algorithm
36
Rule generation for Apriori Algorithm
Pruned the
Rule
37
Apriori Algorithm Flow
Apriori Algorithm Pseudo Code
Apriori Algorithm
Class Exercise
Q1: Find frequent temsets and generate association rules for them. Illustrate it with step-by-step process.
Transaction List of items Minimum support = 2
T1 I1, I2, I3 Minimum confidence = 50%
T2 I2, I3, I4
T3 I4, I5
T4 I1, I2, I4
T5 I1, I2, I3, I5
T6 I1, I2, I3, I4
Q2: Choose minimum support and minimum confidence to your choice and Find frequent itemsets.
Advantages
Uses large itemset property
Easily parallelized
Easy to implement
Disadvantages
Assumes transaction database is memory resident.
Requires many database scans.
Frequent Itemset, Closed Itemset and Maximal Itemset
TID Itemset
1 {A, C, D}
2 {B, C, E}
3 {A, B, C, E}
4 {B, E}
5 {A, B, C, E}
Frequent Itemset, Closed Itemset and Maximal Itemset
Itemset Support FR/In-Fr Itemset Support FR/In-Fr
Frequent Itemset:
A 3/5 Freq ABD 0/5 In-fre
For the following Transaction Database B 4/5 Freq ABE 2/5 Freq
for minimum support 2 the frequent C 4/5 ACD 1/5
Freq In-fre
itemset are: D 1/5 ACE 2/5
In-fre Freq
Total Itemsets 25-1=31 E 4/5
Freq
ADE 0/5 In-fre
Frequent Itemets = 15 AB 2/5 BCD 0/5 In-fre
Freq
Infrequent Itemsets = 16 AC 3/5 BCE 3/5 Freq
Freq
TID Itemset AD 1/5 BDE 0/5 In-fre
In-fre
AE 2/5 CDE 0/5 In-fre
1 {A, C, D} Freq
BC 3/5 ABCD 0/5 In-fre
2 {B, C, E} Freq
BD 0/5 ABCE 2/5 Freq
BE 4/5 In-fre ABDE 0/5 In-fre
3 {A, B, C, E}
CD 1/5 Freq ACDE 0/5 In-fre
4 {B, E}
CE 3/5 In-fre BCDE 0/5 In-fre
5 {A, B, C, E} DE 0/5 Freq ABCDE 0/5 In-fre
Frequent Itemset, Closed Itemset and Maximal Itemset
Frequent Iteset: Itemset
Itemset Support
Support FR/In-Fr
FR/In-Fr Itemset
Itemset Support
Support FR/In-Fr
FR/In-Fr
AA 3/5
3/5 Freq
Freq ABD
ABD 0/5
0/5 In-fre
In-fre
TID Itemset BB 4/5
4/5 ABE
ABE 2/5
2/5 Freq
Freq
Freq
Freq
1 {A, C, D} CC 4/5
4/5 Freq
Freq ACD
ACD 1/5
1/5 In-fre
In-fre
D
D 1/5
1/5 In-fre
In-fre ACE
ACE 2/5
2/5 Freq
Freq
2 {B, C, E}
EE 4/5
4/5 Freq
Freq ADE
ADE 0/5
0/5 In-fre
In-fre
3 {A, B, C, E} AB
AB 2/5
2/5 BCD 0/5 In-fre
Freq
Freq BCD 0/5 In-fre
4 {B, E} AC
AC 3/5
3/5 Freq
Freq BCE
BCE 3/5
3/5 Freq
Freq
AD
AD 1/5
1/5 In-fre BDE
BDE 0/5
0/5 In-fre
In-fre
5 {A, B, C, E} In-fre
AE
AE 2/5
2/5 Freq
Freq CDE
CDE 0/5
0/5 In-fre
In-fre
BC
BC 3/5
3/5 Freq
Freq ABCD
ABCD 0/5
0/5 In-fre
In-fre
Apriori Principle:
BD
BD 0/5
0/5 In-fre
In-fre ABCE
ABCE 2/5
2/5 Fre
Freq
• If an Itemset is Infrequent hen BE
BE 4/5
4/5 Freq
Freq ABDE
ABDE 0/5
0/5 In-Freq
In-fre
all its Supersets are Infrequent. CD
CD 1/5
1/5 In-fre
In-fre ACDE
ACDE 0/5
0/5 In-fre
In-fre
• If an Itemset is frequent then CE
CE 3/5
3/5 Freq
Freq BCDE
BCDE 0/5
0/5 In-fre
In-fre
all its Subsets are Frequent DE
DE 0/5
0/5 In-Freq
In-Freq ABCDE
ABCDE 0/5
0/5 In-fre
In-fre
Frequent Itemset, Closed Itemset and Maximal Itemset
Itemset Support
Due to Apriori Principle, Frequent Itemset: ==> A 3/5
B 4/5
Class Exercise
Q1: Find frequent temsets, closed itemsets and maximal itemsets from the following transaction table.
Illustrate it with step-by-step process.
TID Itemset
100 {I1, I3, I4}
200 {I2, I3, IE}
300 {I1, I2, I3, I5}
400 {I2, I5}
Which Patterns Are Interesting?
=> In fact games and videos are negatively associated not strongly associated
• Support: This says how popular an itemset is, as measured by the proportion of transactions.
Confidence: This says how likely item Y is purchased when item X is purchased, expressed as {X ->
Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.
• Lift: The lift value of an association rule is the ratio of the confidence of the rule and the expected
confidence of the rule. This says how likely item Y is purchased when item X is purchased, while
controlling for how popular item Y is.
Correlation Analysis
Correlation Analysis says howmuch related two items are. [There are two
techniques to find correlation of objects [Lift and Chi-square Method]
• In probabilistics, two elements let A and B are independent:
if P(AՍ B) = P(A) × P(B)
• As the size of item-set increases it reduces the number of transaction by DHP (Direct
Hashing & Purning) and vertical-based mining algorithms
• Use efficient data structures to store the candidates or transactions due to which it is not
required to match every candidate against every transaction
We will study only Apriori Algorithm