Unit - III
Unit - III
Association rule mining finds interesting associations and relationships among large sets of data items.
Association rule mining is a technique used to uncover hidden relationships between variables in large
datasets.
This rule shows how frequently a itemset occurs in a transaction.
Association rule mining is a technique to identify frequent patterns and associations among a set of items.
The process of identifying an association between products/items is called association rule mining.
It is a popular method in data mining and machine learning and has a wide range of applications in various
fields, such as market basket analysis, customer segmentation, and fraud detection.
3
Motivation
Motivation:
Finding inherent regularities in data
What products were often purchased together?
Milk and Bread?
Apriori algorithm
The Apriori algorithm is one of the most widely used algorithms for association rule mining.
It works by first identifying the frequent itemsets in the dataset (itemsets that appear in a
certain number of transactions).
It then uses these frequent itemsets to generate association rules, which are statements of the
form "if item A is purchased, then item B is also likely to be purchased."
The Apriori algorithm uses a bottom-up approach, starting with individual items and
gradually building up to more complex itemsets.
Apriori Algorithm
The apriori algorithm has become one of the most widely used algorithms for frequent itemset
mining and association rule learning.
It has been applied to a variety of applications, including market basket analysis, recommendation
systems, and fraud detection, and has inspired the development of many other algorithms for
similar tasks.
Steps for Apriori Algorithm
Example: Suppose we have the following dataset that has various transactions, and from this dataset, we need to
find the frequent itemsets and generate the association rules using the Apriori algorithm:
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set).
This table is called the Candidate set or C1.
(II) compare candidate set item’s support count with minimum support count(here min_support=2, if
support_count of candidate set items is less than min_support then remove those items).
This gives us itemset L1.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-
1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each
itemset)
Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if support_count
of candidate set item is less than min_support then remove those items) this gives us itemset L2.
Step-3:
Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is that it should have
(K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5} {I1, I3, i5} {I2, I3, I4} {I2, I4, I5} {I2, I3,
I5}
Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset.(Here
subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is
not frequent so remove it. Similarly check for every itemset)
find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items)
should match.
Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is
{I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and
bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association
rules.
26
Association Rule Mining Algorithms