Unit 5 Mining Frequent Patterns and Cluster Analysis
Unit 5 Mining Frequent Patterns and Cluster Analysis
Support :
• This tells about usefulness and certainty of rules.
• The support of itemset X in database D is the number of transactions in D that
contain it: sup(X, D) = |{t ∈ D : t contains X}|
• Support_count(X) : Number of transactions in which X appears. If X is A union B
then it is the number of transactions in which A and B both are present.
• Support(A -> B) = Support_count(A ∪ B)
• 5% Support means total 5% of transactions in database follow the rule.
Confidence:
• The confidence or strength for an association rule A => B is the ratio of the
number of transactions that contain A U B to the number of transactions that
contain A.
• Consider a rule A => B, it is a measure of ratio of the number of tuples containing
both A and B to the number of tuples containing A.
tuples_containing_both_A_and_B
• Confidence A => B =
tuples_containing_A
• A confidence of 60% means that 60% of the customers who purchased a milk and
bread also bought butter.
Frequent itemset :
• An itemset X is a frequent if X’s support is not less than a minimum support threshold.
• Frequent itemset is a set of items that appears at least in a pre-specified transactions.
• Frequent itemsets are typically used to generate Association rules.
Closed itemset :
An item set is closed if none of its immediate supersets have the same support as the
itemset.
Consider two itemset X and Y, if every item of X is in Y but there is at least one item of Y,
which is not in X, then Y is not a proper super-itemset set X.
In this case, itemset X is closed itemset.
If X is closed and frequent itemset then it is called as closed frequent itemset.
Association Rule :
The rules that satisfy both a minimum support threshold (min_sup) and a minimum
confidence threshold (min_conf) are called strong Association rule.
Frequent Itemset Mining Methods :
1 is that it should have (K-2) elements in common. So here, for L2, first element
should match.
• So itemset generated by joining L2 is
{I1, I2, I3} {I1, I2, I5} {I1, I3, i5} {I2, I3, I4} {I2, I4, I5} {I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which
are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly
check for every itemset)
• find support count of these remaining itemset by searching in dataset.
(II)
o Compare candidate (C3) support count with minimum support count (here min_support=2)
o If support_count of candidate set item is less than min_support then remove those items)
o This gives us itemset L3.
Step-4:
(I)
o Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-
1(K=4) is that, they should have (K-2) elements in common. So here, for L3, first
2 elements (items) should match.
o Check all subsets of these itemsets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4
o We stop here because no frequent itemsets are found further
Generation of strong association rule :
For that we need to calculate confidence of each rule.
Confidence –
tuples_containing_both_A_and_B
Confidence A => B =
tuples_containing_A
So here, by taking an example of any frequent itemset, we will show the rule
generation.
Frequent Itemset {I1, I2, I3} //from L3
SO rules can be :
• [I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
• [I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
• [I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
• [I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
• [I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
• [I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 60%, then no rule can be considered as strong association
rules.
Frequent Itemset {I1, I2, I5} //from L3
SO rules can be :
• [I1^I2]=>[I5] //confidence = sup(I1^I2^I5)/sup(I1^I2) = 2/4*100=50%
• [I1^I5]=>[I2] //confidence = sup(I1^I2^I5)/sup(I1^I5) = 2/2*100=100%
• [I2^I5]=>[I1] //confidence = sup(I1^I2^I5)/sup(I2^I5) = 2/2*100=100%
• [I1]=>[I2^I5] //confidence = sup(I1^I2^I5)/sup(I1) = 2/6*100=33%
• [I2]=>[I1^I5] //confidence = sup(I1^I2^I5)/sup(I2) = 2/7*100=28%
• [I5]=>[I1^I2] //confidence = sup(I1^I2^I5)/sup(I5) = 2/2*100=100%
So if minimum confidence is 60%, then the following rules can be considered as strong
association rules.
[I1^I5]=>[I2] confidence = 100%
[I2^I5]=>[I1] confidence = 100%
[I5]=>[I1^I2] confidence = 100%
Limitations Of Apriori Algorithm :
• Using Apriori needs a generation of candidate itemsets. These itemsets may
be large in number if the itemset in the database is huge.
• Apriori needs multiple scans of the database to check the support of each
itemset generated and this leads to high costs
Improving the Efficiency of Apriori :
Many variations of the Apriori algorithm have been proposed that focus on
improving the efficiency of the original algorithm.
Some variations are as follows:
• Hash-based technique (hashing itemsets into corresponding buckets)
• Transaction reduction (reducing the number of transactions scanned in
future iterations)
• Partitioning (partitioning the data to find candidate itemsets)
• Sampling (mining on a subset of the given data)
• Dynamic itemset counting (adding candidate itemsets at different
points during a scan)
A Pattern-Growth Approach for Mining Frequent Itemsets :
The limitations of Apriori pattern mining method can be overcome using the FP
growth algorithm.
Item Support_Count
I1 4
I2 5
I3 4
I4 4
I5 2
Null
2. The first scan of Transaction T1: I2, I1, I3 contains three items {I2:1}, {I1:1}, {I3:1},
where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to I1.
Null
I2 : 1
I1 : 1
I3 : 1
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to
I2 and I4 is linked to I3. But this branch would share I2 node as common as
it is already used in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as
a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
Null
I2 : 2
I3 : 1
I1 : 1
I4 : 1
I3 : 1
5. T3: I4. A new branch with I4 linked to Null as a child is created.
Null
I4 : 1
I2 : 2
I3 : 1
I1 : 1
I4 : 1
I3 : 1
6. T4 : I2, I1, I4. The sequence will be I2, I1, and I4. I2 is already linked to the
root node, hence it will be incremented by 1. Similarly I1 will be
incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2},
{I4:1}.
Null
I4 : 1
I2 : 3
I3 : 1
I1 : 2
I4 : 1
I3 : 1 I4 : 1
7. T5 : I2, I1, I3. The sequence will be I2, I1, I3. Thus {I2:4}, {I1:3}, {I3:2}.
Null
I4 : 1
I2 : 4
I3 : 1
I1 : 3
I4 : 1
I3 : 2 I4 : 1
8. T6 : I2, I1, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.
Null
I4 : 1
I2 : 5
I3 : 1
I1 : 4
I4 : 1
I3 : 3 I4 : 1
I4 : 1
If the occurrences of items are equal to the support then the FP tree is correct.
Mining frequent patterns from FP tree :
• Use FP tree and recursively grow frequent pattern path.(write atoms in support
count ascending order)
• For each item the conditional pattern base is constructed and the conditional FP
tree is also constructed.