DWDM Unit 3
DWDM Unit 3
UNIT-3
Association Analysis
Syllabus: UNIT -III:
Association Analysis: Problem Definition, Frequent Item Set
Generation, The APRIORI Principle, Support and Confidence
Measures, Association Rule Generation, APRIORI Algorithm, The
Partition Algorithms, FP-Growth Algorithm, Compact
Representation of Frequent Item Set-Maximal Frequent Item Set,
Closed Frequent Item Set.
Association Analysis
(Frequent Itemset Mining)
Association Analysis is the task of uncovering relationships among data.
Association analysis is useful for discovering interesting relationships
hidden in large data sets. The uncovered relationships can be represented in
the form of association rules or sets of frequent items.
Association rule is a model that identifies how the data items are associated
with each other. Ex: It is used in retail sales to identify that are frequently
purchased together.
Market Basket Analysis
A typical example of frequent itemset mining is market basket analysis.
This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping baskets”.
Such valuable information can be used to support a variety of business-
related applications such as marketing promotions, inventory management,
and customer relationship management.
Following table gives an example of such data, commonly known as
Market Basket Transactions. Each row in this table corresponds to a
transaction, which contains a unique identifier labeled TID and a set of items
bought by a given customer.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
IT 3-2 Regulation: R19 DWDM: UNIT-3
There are two key issues that need to be addressed when applying
association analysis to market basket data.
First, discovering patterns from a large transaction data set can be
computationally expensive.
Second, some of the discovered patterns may be spurious (happen
simply by chance) and even for non-spurious patterns, some are more
interesting than others.
Terminology used in Association Analysis
Itemset and Support Count: Let{i1,i2,.. .,id} be the set of all
items in a market basket data and T = {t1, t 2 ,..., tN } be the set of all
transactions. Each transaction ti contains a subset of items chosen from I.
In association analysis, a collection of one or more items is termed an
itemset. If an itemset contains k items, it is called a k-itemset.
For instance, {Beer, Diapers, Milk} is an example of a 3-itemset. The null
(or empty) set is an itemset that does not contain any items.
An important property of an itemset is its support count, which refers to
the number of transactions that contain a particular itemset. Mathematically,
the support count, σ(X), for an itemset X can be stated as follows:
Support determines how often a rule is applicable to a given data set, while
confidence determines how frequently items in Y appear in transactions that
contain X.
The formal definitions of these metrics are
(X Y )
Support, s(X Y) = = P(X, Y)
N
(X Y )
Confidence, c(X Y) = =P(Y/X)
(X )
Example: Consider the rule {Milk, Diapers} {Beer}. Because the support
count for {Milk, Diapers, Beer} is 2 and the total number of transactions is
5, the rule’s support is 2/5=0.4. The rule’s confidence is obtained by
dividing the support count for {Milk, Diapers, Beer} by the support count
for {Milk, Diapers}. Since there are 3 transactions that contain milk and
diapers, the confidence for this rule is 2/3=0.67.
Formulation of the Association Rule Mining Problem: The association
rule mining problem can be formally stated as follows: Given a set of
transactions T, find all the rules having support ≥ minsup and confidence ≥
minconf, where minsup and minconf are the corresponding support and
confidence thresholds.
Assuming that neither the left nor the right hand side of the rule is an empty
set, the total number of possible rules, R, extracted from a data set that
contains d items is
.
Example: suppose data set contain 6 items, total number of rules extracted is
All the association rules generated from same itemset are identical support.
For example, the following rules have identical support because they involve
items from the same itemset,{Beer, Diapers, Milk}:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
IT 3-2 Regulation: R19 DWDM: UNIT-3
Apriori Algorithm
Apriori algorithm is a classical algorithm in data mining.
It is used for mining frequent itemsets and relevant association rules.
It is devised to operate on a database containing a lot of transactions, for
instance, items brought by customers in a store
It is given by R. Agrawal and R. Srikant in 1994 for finding frequent
itemsets in a dataset for boolean association rule.
Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties.
We apply an iterative approach or level-wise search where k-frequent
itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori property which helps by reducing
the search space.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
IT 3-2 Regulation: R19 DWDM: UNIT-3
Support of an itemset never exceeds the support of its subsets, this is known
as the anti-monotone property of support.
Algorithm:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
-Generate length (k+1) candidate itemsets from length k frequent
itemsets(to obtain k-itemsets, merge a pair of frequent (k-1)-
itemsets only if their first (k-2) items are identical) --Join step
-Prune candidate itemsets containing subsets of length k that are
Infrequent --Prune step
-Count the support of each candidate itemset by scanning the data set
-Eliminate candidates that are infrequent, leaving only those that
are frequent.
Join & Prune steps of Apriori Algorithm
Join Step:
To obtain k-itemsets, merge a pair of frequent (k-1)-itemsets only
if their first (k-2) items are identical. This set of candidate k-items
is denoted by Ck
Prune Step:
This step scans the count of each item in the database. If the
candidate item does not meet minimum support, then it is regarded
as infrequent and thus it is removed. Resulting set is set of frequent
k-items ,denoted by Lk.
To reduce size of Ck, the apriori property is used:Any (k-1) itemset
that is not frequent cannot be a subset of a frequent k-itemset.
Hence if any (k-1)-subset of a Ck is not in Lk-1,then the candidate
cannot be frequent either so can be removed from Ck (Subset
Testing)
NOTE:
Ck is super set of Lk
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
IT 3-2 Regulation: R19 DWDM: UNIT-3
Example:1
Frequent Itemsets:
{Beer},{Bread},{Diapers},{Milk},{Beer,Diapers},{Bread,Diapers},{Bread,Milk},{Diapers,Milk}
Example 2:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
IT 3-2 Regulation: R19 DWDM: UNIT-3
Example 3:
Transactions table:
Minsup=2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
IT 3-2 Regulation: R19 DWDM: UNIT-3
Rule Generation
Each frequent k-itemset, Y, can produce up to 2k−2 association rules, ignoring
rules that have empty antecedents or consequents (∅ Y or Y ∅).
An association rule can be extracted by partitioning the itemset Y into two
non-empty subsets, X and Y −X, such that X Y −X satisfies the confidence
threshold.
Note that all such rules must have already met the support threshold because
they are generated from a frequent itemset.
Example: Let X = {a, b, c} be a frequent itemset. There are six candidate
association rules that can be generated from X: {a, b} {c}, {a, c} {b},
{b, c} {a}, {a} {b, c}, {b} {a, c}, and {c} {a, b}.
Problem 1: A database has four transactions. Let minsup =50% and minconf=80%.
Find all frequent itemsets using Apriori and List all the strong association rules.
TID Items
1 {A,C,D}
2 {B,C,E}
3 {A,B,C,E}
4 {B,E}
Solution:
Given that minsup =50%
50 4
Support count= =2
100
Candidate 1-itemsets
Item Count
A 2
B 3
C 3
D 1
E 3
Frequent 1-itemsets
Item Count
A 2
B 3
C 3
E 3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
IT 4-1 Regulation: R16 DWBI: UNIT-3
Candidate 2-itemsets
Item Count
AB 1
AC 2
AE 1
BC 2
BE 2
CE 2
Frequent 2-itemsets
Item Count
AC 2
BC 2
BE 2
CE 2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
IT 4-1 Regulation: R16 DWBI: UNIT-3
2
CE ×100 = 66.67%
3
2
EC ×100 = 66.67%
3
2
BCE ×100 = 66.67%
3
2
CEB ×100 = 100%
2
2
BCE ×100 = 100%
2
2
EBC ×100 = 66.67%
3
2
CBE ×100 = 66.67%
3
2
BEC ×100 = 100%
2
Since minconf=80%. , The strong association rules are:
AC, BE, CEB, BCE , BE C
Problem 2: Find all frequent itemsets and strong association rules of the following
data set. Support threshold=50%, Confidence= 60%
TABLE-1
List of
Transaction
items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => minsup=3
1. Count of Each Item
Table-2: Candidate 1-itemsets
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
IT 4-1 Regulation: R16 DWBI: UNIT-3
2. Prune Step: TABLE -2 shows that I5 item does not meet minsup=3, thus it
is deleted, only I1, I2, I3, I4 meet minsup count.
Table-3: Frequent 1-itemsets
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset.From TABLE-1 find out the occurrences of 2-itemsets.
TABLE-4: Candidate 2-itemsets
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not
meet minsup, thus it is deleted.
TABLE-5: Frequent 2-itemsets
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 5 find out
occurrences of 3-itemset. From TABLE-5, find out the 2-itemset subsets which
support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are
occurring in TABLE-5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4}
is not frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is not
frequent, hence it is deleted.
TABLE-6: Candidate 3-itemsets
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
IT 4-1 Regulation: R16 DWBI: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
IT 4-1 Regulation: R16 DWBI: UNIT-3
Candidate 1-itemsets
(II) compare candidate set item’s support count with minimum support
count(here min_support=2 if support_count of candidate set items is less than
min_support then remove those items).
Frequent 1-itemsets
Candidate 2-itemsets:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
IT 4-1 Regulation: R16 DWBI: UNIT-3
Frequent 2-itemsets:
Candidate 3-itemsets:
Frequent 3-itemsets:
Check all subsets of these itemsets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no frequent 4-itemset
We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence:
A confidence of 60% means that 60% of the customers, who purchased milk
and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
IT 4-1 Regulation: R16 DWBI: UNIT-3
Problem 4: Consider the following dataset and find frequent itemsets and
generate strong association rules for them. Minimum support count is 2 and
minimum confidence is 60%
Iteration 1: Given that the minsup is 2 and create the item sets of the size of 1 and
calculate their support values.
As you can see here, item 4 has a support value of 1 which is less than the min
support value. So we are going to discard {4} in the upcoming iterations. We have
the final Table F1.
Iteration 2: Next we will create itemsets of size 2 and calculate their support values.
All the combinations of items set in F1 are used in this iteration.
Itemsets having Support less than 2 are eliminated again. In this case {1,2}. Now,
Let’s understand what is pruning and how it makes Apriori one of the best algorithm
for finding frequent itemsets.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
IT 4-1 Regulation: R16 DWBI: UNIT-3
Pruning: We are going to divide the itemsets in C3 into subsets and eliminate the
subsets that are having a support value less than 2.
Iteration 3: We will discard {1,2,3} and {1,2,5} as they both contain {1,2}. This is the
main highlight of the Apriori Algorithm.
Since the Support of this itemset is less than 2, we will stop here and the final
itemset we will have is F3.
Note: Till now we haven’t calculated the confidence values yet.
With F3 we get the following itemsets:
For I = {1,3,5}, subsets are {1,3}, {1,5}, {3,5}, {1}, {3}, {5}
For I = {2,3,5}, subsets are {2,3}, {2,5}, {3,5}, {2}, {3}, {5}
Applying Rules: We will create rules and apply them on itemset F3. Given that
minimum confidence value is 60%.
For every subsets S of I, you output the rule
{1,3,5}
Rule 1: {1,3} –> ({1,3,5} – {1,3}) means 1 & 3 –> 5
Confidence = support(1,3,5)/support(1,3) = 2/3 = 66.66% > 60%
Hence Rule 1 is Selected
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
IT 4-1 Regulation: R16 DWBI: UNIT-3
Rule 2 is Selected
Rule 6 is Rejected
This is how you create rules in Apriori Algorithm and the same steps can be
implemented for the itemset {2,3,5}.
FP-Growth Algorithm
(Algorithm for finding frequent itemsets without candidate generation)
The FP-Growth Algorithm is an alternative way to find frequent itemsets
without using candidate generations, thus improving performance. For so
much it uses a divide-and-conquer strategy.
FP-Growth is a very fast and memory efficient algorithm.
The algorithm encodes the data set using a compact data structure called
an FP-tree and extracts frequent itemsets directly from this structure.
In simple words, this algorithm works as follows:
First it compresses the input database creating an FP-tree instance
to represent frequent items.
After this first step it divides the compressed database into a set of
conditional databases, each one associated with one frequent
pattern.
Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs
looking for short patterns recursively and then concatenating them
in the long frequent patterns, offering good selectivity.
FP-Tree Representation
An FP-tree is a compressed representation of the input data.
It is constructed by reading the data set one transaction at a time and
mapping each transaction onto a path in the FP-tree.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
IT 4-1 Regulation: R16 DWBI: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
IT 4-1 Regulation: R16 DWBI: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
IT 4-1 Regulation: R16 DWBI: UNIT-3
Problem 1: Find all frequent item sets using FP-growth algorithm. Let
minsup=2 and minconf=70%
FP-Tree:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
IT 4-1 Regulation: R16 DWBI: UNIT-3
Problem:2 A database has five transactions. Let min sup = 60% and min conf =
80%.Find all frequent itemsets using FP-Growth
Solution:
60
Minsup= 60%= ×5=3
100
Candidate 1-itemsets
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
IT 4-1 Regulation: R16 DWBI: UNIT-3
Problem: 3 Find all frequent itemsets using FP-Growth algorithm. Let minsup=3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
IT 4-1 Regulation: R16 DWBI: UNIT-3
Fp-Tree:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
IT 4-1 Regulation: R16 DWBI: UNIT-3
In the above lattice, itemsets {a, d}, {a, c, e}, and {b, c, d, e} are maximal
frequent itemsets because all of their immediate supersets are infrequent.
For example, the itemset {a, d} is maximal frequent because all of its
immediate supersets, {a, b, d}, {a, c, d}, and {a, d, e}, are infrequent. In
contrast, {a, c} is non-maximal because one of its immediate supersets,
{a, c, e}, is frequent.
Maximal frequent itemsets effectively provide a compact representation
of frequent itemsets. In other words, they form the smallest set of itemsets
from which all frequent itemsets can be derived. For example, every
frequent itemset in Figure 5.16 is a subset of one of the three maximal
frequent itemsets, {a, d}, {a, c, e}, and {b, c, d, e}. If an itemset is not a
proper subset of any of the maximal frequent itemsets, then it is either
infrequent (e.g., {a, d, e}) or maximal frequent itself (e.g., {b, c, d, e}).
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
IT 4-1 Regulation: R16 DWBI: UNIT-3
Hence, the maximal frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e}
provide a compact representation of the frequent itemsets shown in above
lattice.
Closed Itemsets
Closed itemsets provide a minimal representation of all itemsets without
losing their support information.
An itemset X is closed if none of its immediate supersets has exactly the
same support count as X. Put another way, X is not closed if at least one
of its immediate supersets has the same support count as X.
An interesting property of closed itemsets is that if we know their support
counts, we can derive the support count of every other itemset in the
itemset lattice without making additional passes over the data set.
Closed Frequent Itemset
An itemset is a closed frequent itemset if it is closed and its support is
greater than or equal to minsup.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
IT 4-1 Regulation: R16 DWBI: UNIT-3
Problem: Find all closed frequent itemsets and maximal frequent itemsets of
the following data set .Let minsup=3
1: A,B,C,E
2: A,C,D,E
3: B,C,E
4: A,C,D,E
5: C,D,E
6: A,D,E
Solution:
Candidate 1-itemsets
Item Count
A 4
B 2
C 5
D 4
E 6
Frequent 1-itemsets
Item Count
A 4
C 5
D 4
E 6
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 26
IT 4-1 Regulation: R16 DWBI: UNIT-3
Candidate 2-itemsets
itemset Count
AC 3
AD 3
AE 4
CD 3
CE 5
DE 4
Frequent 2-itemsets
itemset Count
AC 3
AD 3
AE 4
CD 3
CE 5
DE 4
Candidate 3-itemsets
itemset Count
ACD 2
ACE 3
ADE 3
CDE 3
Frequent 3-itemsets
itemset Count
ACE 3
ADE 3
CDE 3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
IT 4-1 Regulation: R16 DWBI: UNIT-3
{D} = 4 ; not closed due to {D,E}, but not maximal due to e.g. {A,D}
Problem: Find all frequent itemsets, maximal frequent itemsets and closed
frequent itemsets of the following data set. Let minsup=1
TID Items
1 {A,B,C,D}
2 {A,B}
Solution:
Given minsup=1
Frequent Itemsets:
A,B,C,D,AB,AC,AD,BC,BD,CD,ABC,ABD,BCD,ACD,ABCD
Maximal Frequent Itemsets: ABCD
Closed Frequent Itemsets: AB, ABCD
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
IT 4-1 Regulation: R16 DWBI: UNIT-3
Tutorial Questions
1. Discuss Apriori algorithm with suitable example.
(or) Write an algorithm for finding frequent item sets using candidate
generation.
3. Briefly describe the relation among frequent. maximal frequent and closed
frequent item sets
5. A database has four transactions. Let min-sup =60% and min-conf = 80%. Find all
frequent itemsets using Apriori and List all the strong association rules.
10. Consider the transactions occurring the in the given order: {{a,b}, {b,c,d},
{a,b,c}, {a,b,c,d}, {a,b,c}}. Draw FP tree after each transaction.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 29