0% found this document useful (0 votes)
11 views29 pages

DWDM Unit 3

The document discusses Association Analysis, focusing on frequent itemset mining and the generation of association rules, particularly through algorithms like Apriori and FP-Growth. It explains key concepts such as support, confidence, and the importance of market basket analysis in understanding consumer behavior. Additionally, it provides examples and problem-solving scenarios to illustrate the application of these analytical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views29 pages

DWDM Unit 3

The document discusses Association Analysis, focusing on frequent itemset mining and the generation of association rules, particularly through algorithms like Apriori and FP-Growth. It explains key concepts such as support, confidence, and the importance of market basket analysis in understanding consumer behavior. Additionally, it provides examples and problem-solving scenarios to illustrate the application of these analytical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

IT 3-2 Regulation: R19 DWDM: UNIT-3

UNIT-3
Association Analysis
Syllabus: UNIT -III:
Association Analysis: Problem Definition, Frequent Item Set
Generation, The APRIORI Principle, Support and Confidence
Measures, Association Rule Generation, APRIORI Algorithm, The
Partition Algorithms, FP-Growth Algorithm, Compact
Representation of Frequent Item Set-Maximal Frequent Item Set,
Closed Frequent Item Set.
Association Analysis
(Frequent Itemset Mining)
 Association Analysis is the task of uncovering relationships among data.
 Association analysis is useful for discovering interesting relationships
hidden in large data sets. The uncovered relationships can be represented in
the form of association rules or sets of frequent items.
 Association rule is a model that identifies how the data items are associated
with each other. Ex: It is used in retail sales to identify that are frequently
purchased together.
Market Basket Analysis
 A typical example of frequent itemset mining is market basket analysis.
 This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping baskets”.
 Such valuable information can be used to support a variety of business-
related applications such as marketing promotions, inventory management,
and customer relationship management.
 Following table gives an example of such data, commonly known as
Market Basket Transactions. Each row in this table corresponds to a
transaction, which contains a unique identifier labeled TID and a set of items
bought by a given customer.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
IT 3-2 Regulation: R19 DWDM: UNIT-3

 There are two key issues that need to be addressed when applying
association analysis to market basket data.
 First, discovering patterns from a large transaction data set can be
computationally expensive.
 Second, some of the discovered patterns may be spurious (happen
simply by chance) and even for non-spurious patterns, some are more
interesting than others.
Terminology used in Association Analysis
 Itemset and Support Count: Let{i1,i2,.. .,id} be the set of all
items in a market basket data and T = {t1, t 2 ,..., tN } be the set of all
transactions. Each transaction ti contains a subset of items chosen from I.
 In association analysis, a collection of one or more items is termed an
itemset. If an itemset contains k items, it is called a k-itemset.
 For instance, {Beer, Diapers, Milk} is an example of a 3-itemset. The null
(or empty) set is an itemset that does not contain any items.
 An important property of an itemset is its support count, which refers to
the number of transactions that contain a particular itemset. Mathematically,
the support count, σ(X), for an itemset X can be stated as follows:

where the symbol |·| denotes the number of elements in a set.


 In the data set shown in following table, the support count for {Beer,
Diapers, Milk} is equal to two because there are only two transactions that
contain all three items.

 Often, the property of interest is the support, which is fraction of transactions


in which an itemset occurs: s(X) = σ(X)/N.
Support count: Number of transactions in which itemset occurs
Support: Fraction of transactions in which itemset occurs
 An itemset X is called frequent if s(X) is greater than some user-defined
threshold, minsup.
Itemset: A collection of one or more items
k-itemset: An itemset contains k items
Frequent itemset: An itemset that appear frequently in a dataset
 Association Rule: An association rule is an implication expression of the
form X  Y, where X and Y are disjoint itemsets, i.e., X ∩ Y = ∅.
 The strength of an association rule can be measured in terms of its support
and confidence.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 2
IT 3-2 Regulation: R19 DWDM: UNIT-3

 Support determines how often a rule is applicable to a given data set, while
confidence determines how frequently items in Y appear in transactions that
contain X.
 The formal definitions of these metrics are
 (X  Y )
Support, s(X Y) = = P(X, Y)
N
 (X  Y )
Confidence, c(X  Y) = =P(Y/X)
 (X )

 Example: Consider the rule {Milk, Diapers}  {Beer}. Because the support
count for {Milk, Diapers, Beer} is 2 and the total number of transactions is
5, the rule’s support is 2/5=0.4. The rule’s confidence is obtained by
dividing the support count for {Milk, Diapers, Beer} by the support count
for {Milk, Diapers}. Since there are 3 transactions that contain milk and
diapers, the confidence for this rule is 2/3=0.67.
 Formulation of the Association Rule Mining Problem: The association
rule mining problem can be formally stated as follows: Given a set of
transactions T, find all the rules having support ≥ minsup and confidence ≥
minconf, where minsup and minconf are the corresponding support and
confidence thresholds.
 Assuming that neither the left nor the right hand side of the rule is an empty
set, the total number of possible rules, R, extracted from a data set that
contains d items is

.
 Example: suppose data set contain 6 items, total number of rules extracted is

 All the association rules generated from same itemset are identical support.
For example, the following rules have identical support because they involve
items from the same itemset,{Beer, Diapers, Milk}:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
IT 3-2 Regulation: R19 DWDM: UNIT-3

 A data set that contains d items can potentially generate up to 2d -1


frequent itemsets, excluding the null set.
 A data set that contains d items can potentially extract up to 3d - 2d+1 +1
association rules, excluding the null rules.
 Each k-itemset, can produce up to 2k -2 association rules, excluding null
rules.
 All the association rules generated from same itemset are identical
support.

 A common strategy adopted by many association rule mining algorithms is


to decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the
itemsets that satisfy the minsup threshold
2. Rule Generation, whose objective is to extract all the high confidence
rules from the frequent itemsets found in the previous step. These rules are
called strong rules.
 Transaction width is the number of items present in a transaction.

Efficient and Scalable Frequent Itemset Mining Methods


1. The Apriori Algorithm for finding Frequent Itemsets Using Candidate
Generation
2. The FP-Growth Algorithm for finding Frequent Itemsets without
Candidate Generation.

Apriori Algorithm
 Apriori algorithm is a classical algorithm in data mining.
 It is used for mining frequent itemsets and relevant association rules.
 It is devised to operate on a database containing a lot of transactions, for
instance, items brought by customers in a store
 It is given by R. Agrawal and R. Srikant in 1994 for finding frequent
itemsets in a dataset for boolean association rule.
 Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties.
 We apply an iterative approach or level-wise search where k-frequent
itemsets are used to find k+1 itemsets.
 To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori property which helps by reducing
the search space.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
IT 3-2 Regulation: R19 DWDM: UNIT-3

Apriori Property (Apriori Principle):


 All non-empty subset of frequent itemset must be frequent.
i.e. If an itemset is infrequent, all its supersets will be infrequent.
 Apriori principle holds due to the following property of the support measure:

 Support of an itemset never exceeds the support of its subsets, this is known
as the anti-monotone property of support.
Algorithm:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
-Generate length (k+1) candidate itemsets from length k frequent
itemsets(to obtain k-itemsets, merge a pair of frequent (k-1)-
itemsets only if their first (k-2) items are identical) --Join step
-Prune candidate itemsets containing subsets of length k that are
Infrequent --Prune step
-Count the support of each candidate itemset by scanning the data set
-Eliminate candidates that are infrequent, leaving only those that
are frequent.
Join & Prune steps of Apriori Algorithm
Join Step:
 To obtain k-itemsets, merge a pair of frequent (k-1)-itemsets only
if their first (k-2) items are identical. This set of candidate k-items
is denoted by Ck
Prune Step:
 This step scans the count of each item in the database. If the
candidate item does not meet minimum support, then it is regarded
as infrequent and thus it is removed. Resulting set is set of frequent
k-items ,denoted by Lk.
 To reduce size of Ck, the apriori property is used:Any (k-1) itemset
that is not frequent cannot be a subset of a frequent k-itemset.
Hence if any (k-1)-subset of a Ck is not in Lk-1,then the candidate
cannot be frequent either so can be removed from Ck (Subset
Testing)
NOTE:
Ck is super set of Lk

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
IT 3-2 Regulation: R19 DWDM: UNIT-3

Example:1

Frequent Itemsets:
{Beer},{Bread},{Diapers},{Milk},{Beer,Diapers},{Bread,Diapers},{Bread,Milk},{Diapers,Milk}
Example 2:

Frequent Itemsets: {A},{B},{C},{E},{A,C},{B,C},{B,E},{C,E},{B,C,E}

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
IT 3-2 Regulation: R19 DWDM: UNIT-3

Example 3:
Transactions table:

Minsup=2

Frequent Itemsets are: {I1},{I2},{I3},{I4},{I5},{I1,I2},{I1,I3},{I1,I5}, {I2,I3},


{I2,I4},{I2,I5}, {I1,I2,I3}, {I1,I2,I5}

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
IT 3-2 Regulation: R19 DWDM: UNIT-3

Rule Generation
 Each frequent k-itemset, Y, can produce up to 2k−2 association rules, ignoring
rules that have empty antecedents or consequents (∅ Y or Y ∅).
 An association rule can be extracted by partitioning the itemset Y into two
non-empty subsets, X and Y −X, such that X Y −X satisfies the confidence
threshold.
 Note that all such rules must have already met the support threshold because
they are generated from a frequent itemset.
 Example: Let X = {a, b, c} be a frequent itemset. There are six candidate
association rules that can be generated from X: {a, b} {c}, {a, c}  {b},
{b, c}  {a}, {a} {b, c}, {b} {a, c}, and {c} {a, b}.
Problem 1: A database has four transactions. Let minsup =50% and minconf=80%.
Find all frequent itemsets using Apriori and List all the strong association rules.

TID Items
1 {A,C,D}
2 {B,C,E}
3 {A,B,C,E}
4 {B,E}
Solution:
Given that minsup =50%
50  4
Support count= =2
100
Candidate 1-itemsets
Item Count
A 2
B 3
C 3
D 1
E 3
Frequent 1-itemsets
Item Count
A 2
B 3
C 3
E 3

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
IT 4-1 Regulation: R16 DWBI: UNIT-3

Candidate 2-itemsets
Item Count
AB 1
AC 2
AE 1
BC 2
BE 2
CE 2

Frequent 2-itemsets
Item Count
AC 2
BC 2
BE 2
CE 2

Candidate 3-itemsets Frequent 3-itemsets


Item Count item Count
ABC 1 BCE 2
ACE 1
BCE 2

Frequent item sets are : {A},{B},{C},{E},{A,C},{B,C},{B,E},{C,E}{B,C,E}


Association Rules:
Rule Confidence
2
AC ×100 = 100%
2
2
CA ×100 = 66.67%
3
2
BC ×100 = 66.67%
3
2
CB ×100 = 66.67%
3
2
BE ×100 = 66.67%
3
2
EB ×100 = 66.67%
3

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
IT 4-1 Regulation: R16 DWBI: UNIT-3

2
CE ×100 = 66.67%
3
2
EC ×100 = 66.67%
3
2
BCE ×100 = 66.67%
3
2
CEB ×100 = 100%
2
2
BCE ×100 = 100%
2
2
EBC ×100 = 66.67%
3
2
CBE ×100 = 66.67%
3
2
BEC ×100 = 100%
2
Since minconf=80%. , The strong association rules are:
AC, BE, CEB, BCE , BE C

Problem 2: Find all frequent itemsets and strong association rules of the following
data set. Support threshold=50%, Confidence= 60%
TABLE-1
List of
Transaction
items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => minsup=3
1. Count of Each Item
Table-2: Candidate 1-itemsets
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
IT 4-1 Regulation: R16 DWBI: UNIT-3

2. Prune Step: TABLE -2 shows that I5 item does not meet minsup=3, thus it
is deleted, only I1, I2, I3, I4 meet minsup count.
Table-3: Frequent 1-itemsets
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset.From TABLE-1 find out the occurrences of 2-itemsets.
TABLE-4: Candidate 2-itemsets
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not
meet minsup, thus it is deleted.
TABLE-5: Frequent 2-itemsets
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 5 find out
occurrences of 3-itemset. From TABLE-5, find out the 2-itemset subsets which
support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are
occurring in TABLE-5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4}
is not frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is not
frequent, hence it is deleted.
TABLE-6: Candidate 3-itemsets
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
IT 4-1 Regulation: R16 DWBI: UNIT-3

Only {I1, I2, I3} is frequent.


TABLE-7: Frequent 3-itemsets
Item
I1,I2,I3
6. Generate Association Rules: From the frequent itemset discovered above
the association could be:
{I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
{I1, I3} => {I2}
Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%
{I2, I3} => {I1}
Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
{I1} => {I2, I3}
Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
{I2} => {I1, I3}
Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%
{I3} => {I1, I2}
Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%
This shows that all the above association rules are strong if minimum
confidence threshold is 60%.
Problem 3: Consider the following dataset and find frequent itemsets and
generate strong association rules for them. Minimum support count is 2 and
minimum confidence is 60%

minimum support count is 2


minimum confidence is 60%
Create a table containing support count of each item present in dataset

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
IT 4-1 Regulation: R16 DWBI: UNIT-3

Candidate 1-itemsets

(II) compare candidate set item’s support count with minimum support
count(here min_support=2 if support_count of candidate set items is less than
min_support then remove those items).
Frequent 1-itemsets

Candidate 2-itemsets:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
IT 4-1 Regulation: R16 DWBI: UNIT-3

Frequent 2-itemsets:

Candidate 3-itemsets:

Frequent 3-itemsets:
 Check all subsets of these itemsets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no frequent 4-itemset
 We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of
each rule.
Confidence:
A confidence of 60% means that 60% of the customers, who purchased milk
and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
IT 4-1 Regulation: R16 DWBI: UNIT-3

Problem 4: Consider the following dataset and find frequent itemsets and
generate strong association rules for them. Minimum support count is 2 and
minimum confidence is 60%

Iteration 1: Given that the minsup is 2 and create the item sets of the size of 1 and
calculate their support values.

As you can see here, item 4 has a support value of 1 which is less than the min
support value. So we are going to discard {4} in the upcoming iterations. We have
the final Table F1.

Iteration 2: Next we will create itemsets of size 2 and calculate their support values.
All the combinations of items set in F1 are used in this iteration.

Itemsets having Support less than 2 are eliminated again. In this case {1,2}. Now,
Let’s understand what is pruning and how it makes Apriori one of the best algorithm
for finding frequent itemsets.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
IT 4-1 Regulation: R16 DWBI: UNIT-3

Pruning: We are going to divide the itemsets in C3 into subsets and eliminate the
subsets that are having a support value less than 2.

Iteration 3: We will discard {1,2,3} and {1,2,5} as they both contain {1,2}. This is the
main highlight of the Apriori Algorithm.

Iteration 4: Using sets of F3 we will create C4.

Since the Support of this itemset is less than 2, we will stop here and the final
itemset we will have is F3.
Note: Till now we haven’t calculated the confidence values yet.
With F3 we get the following itemsets:
For I = {1,3,5}, subsets are {1,3}, {1,5}, {3,5}, {1}, {3}, {5}
For I = {2,3,5}, subsets are {2,3}, {2,5}, {3,5}, {2}, {3}, {5}

Applying Rules: We will create rules and apply them on itemset F3. Given that
minimum confidence value is 60%.
For every subsets S of I, you output the rule

 S –> (I-S) (means S recommends I-S)


 if support(I) / support(S) >= minconf value

{1,3,5}
Rule 1: {1,3} –> ({1,3,5} – {1,3}) means 1 & 3 –> 5
Confidence = support(1,3,5)/support(1,3) = 2/3 = 66.66% > 60%
Hence Rule 1 is Selected

Rule 2: {1,5} –> ({1,3,5} – {1,5}) means 1 & 5 –> 3


Confidence = support(1,3,5)/support(1,5) = 2/2 = 100% > 60%

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
IT 4-1 Regulation: R16 DWBI: UNIT-3

Rule 2 is Selected

Rule 3: {3,5} –> ({1,3,5} – {3,5}) means 3 & 5 –> 1


Confidence = support(1,3,5)/support(3,5) = 2/3 = 66.66% > 60%
Rule 3 is Selected

Rule 4: {1} –> ({1,3,5} – {1}) means 1 –> 3 & 5


Confidence = support(1,3,5)/support(1) = 2/3 = 66.66% > 60%
Rule 4 is Selected

Rule 5: {3} –> ({1,3,5} – {3}) means 3 –> 1 & 5


Confidence = support(1,3,5)/support(3) = 2/4 = 50% <60%
Rule 5 is Rejected

Rule 6: {5} –> ({1,3,5} – {5}) means 5 –> 1 & 3


Confidence = support(1,3,5)/support(5) = 2/4 = 50% < 60%

Rule 6 is Rejected

This is how you create rules in Apriori Algorithm and the same steps can be
implemented for the itemset {2,3,5}.

FP-Growth Algorithm
(Algorithm for finding frequent itemsets without candidate generation)
 The FP-Growth Algorithm is an alternative way to find frequent itemsets
without using candidate generations, thus improving performance. For so
much it uses a divide-and-conquer strategy.
 FP-Growth is a very fast and memory efficient algorithm.
 The algorithm encodes the data set using a compact data structure called
an FP-tree and extracts frequent itemsets directly from this structure.
 In simple words, this algorithm works as follows:
 First it compresses the input database creating an FP-tree instance
to represent frequent items.
 After this first step it divides the compressed database into a set of
conditional databases, each one associated with one frequent
pattern.
 Finally, each such database is mined separately.
 Using this strategy, the FP-Growth reduces the search costs
looking for short patterns recursively and then concatenating them
in the long frequent patterns, offering good selectivity.
FP-Tree Representation
 An FP-tree is a compressed representation of the input data.
 It is constructed by reading the data set one transaction at a time and
mapping each transaction onto a path in the FP-tree.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
IT 4-1 Regulation: R16 DWBI: UNIT-3

 As different transactions can have several items in common, their paths


might overlap. The more the paths overlap with one another, the more
compression we can achieve using the FP-tree structure.
 Following figure shows a data set that contains ten transactions and five
items. The structures of the FP-tree after reading the first three
transactions are also depicted in the diagram.
 Each node in the tree contains the label of an item along with a counter
that shows the number of transactions mapped onto the given path.
 Initially, the FP-tree contains only the root node represented by the null
symbol.
 The FP-tree is subsequently extended in the following way:
1. The data set is scanned once to determine the support count of each
item. Infrequent items are discarded, while the frequent items are
sorted in decreasing support counts inside every transaction of the
data set. For the data set shown in Figure 5.24, a is the most frequent
item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the
FPtree. After reading the first transaction, {a, b}, the nodes labeled as
a and b are created. A path is then formed from null → a → b to
encode the transaction. Every node along the path has a frequency
count of 1.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
IT 4-1 Regulation: R16 DWBI: UNIT-3

3. After reading the second transaction, {b,c,d}, a new set of nodes is


created for items b, c, and d. A path is then formed to represent the
transaction by connecting the nodes null → b → c → d. Every node
along this path also has a frequency count equal to one. Although
thefirst two transactions have an item in common, which is b, their
paths are disjoint because the transactions do not share a common
prefix.
4. The third transaction, {a,c,d,e}, shares a common prefix item (which
is a) with the first transaction. As a result, the path for the third
transaction, null → a → c → d → e, overlaps with the path for the
first transaction, null → a → b. Because of their overlapping path, the
frequency count for node a is incremented to two, while the frequency
counts for the newly created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto
one of the paths given in the FP-tree. The resulting FP-tree after
reading all the transactions is shown at the above figure 5.24.
 The size of an FP-tree is typically smaller than the size of the
uncompressed data because many transactions in market basket data often
share a few items in common.
 The size of an FP-tree also depends on how the items are ordered
 An FP-tree also contains a list of pointers connecting nodes that have the
same items. These pointers, represented as dashed lines in Figures 5.24
and 5.25, help to facilitate the rapid access of individual items in the tree.
Frequent Itemset Generation in FP-Growth Algorithm
 The FP-tree is mined as follows. Start from each frequent length-1 pattern
(as aninitial suffix pattern), construct its conditional pattern base (a
“sub-database,” which consists of the set of prefix paths in the FP-tree co-
occurring with the suffix pattern), then construct its (conditional) FP-tree,
and performmining recursively on the tree.
 The pattern growth is achieved by the concatenation of the suffix pattern
with the frequent patterns generated from a conditional FP-tree
 The algorithm looks for frequent itemsets ending in e first, followed by d,
c, b, and finally, a.
 We first consider e, e occurs in three FP-tree branches of Figure 5.24 .The
paths formed by these branches are <a,c,d,e:1.> , <a,d,e:1> and <b,c,e:1>
.Therefore, considering e as a suffix, its corresponding three prefix paths
are <a,c,d:1.> ,<a,d:1> and <b,c:1>, which form its conditional pattern
base. Using this conditional pattern base as a transaction database, we
build an e-conditional FP-tree, which contains two paths , <a:2,d:2>
and<c:2>; c is not included in first path because its support count of 1 is
less than the minimum support count. The two paths generates all the
combinations of frequent patterns: ae,de,ade and ce.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
IT 4-1 Regulation: R16 DWBI: UNIT-3

 This mining process is summarized in following figure:

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns


Generated
e {{a,c,d:1},{a,d:1},{b,c:1}} <a:2,d:2>,<c:2> ae,de,ade,ce
d {{a,b,c:1},{a,b:1},{a,c:1}, <a:2,b:2>,<a:2,c:2> ad,bd,abd,cd,acd,bcd
{a:1},{b,c:1}} <b: 3, c:3 >
c {{a,b:3},{a:1},{b:2}} <a:4,b:3>,<b:2> ac,bc,abc
b {{a:5}} <a:5> Ab

Problem 1: Find all frequent item sets using FP-growth algorithm. Let
minsup=2 and minconf=70%

FP-Tree:

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
IT 4-1 Regulation: R16 DWBI: UNIT-3

Problem:2 A database has five transactions. Let min sup = 60% and min conf =
80%.Find all frequent itemsets using FP-Growth

Solution:
60
Minsup= 60%= ×5=3
100

Candidate 1-itemsets

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
IT 4-1 Regulation: R16 DWBI: UNIT-3

Problem: 3 Find all frequent itemsets using FP-Growth algorithm. Let minsup=3

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
IT 4-1 Regulation: R16 DWBI: UNIT-3

Fp-Tree:

Item Conditional Pattern Conditional Frequent Patterns Generated


Base FP-tree
p {{f,c,a,m:2},{c,b:1}} <c:3> Cp
m {{f,c,a:2},{f,c,a,b:1}} <f:3><c:3>,<a:3> fm,cm,am,acm,afm,
cfm,acfm
b {{f,c,a:1},{f:1} - -
a {{f,c:3} <f:3>,<c:3> af,ac,acf
c {{f:3}} <f:3> Cf
Therefore frequent itemsets are:
a,b,c,f,m,p.cp,fm,cm,am,af,ac,cf,acm,afm,cfm,acf,acfm
Apriori versus FP-Growth

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
IT 4-1 Regulation: R16 DWBI: UNIT-3

Compact Representation of Frequent Itemsets


(Maximal & Closed Frequent itemset)
 The number of frequent itemsets produced from a transaction data set can
be very large.
 It is useful to identify a small representative set of frequent itemsets from
which all other frequent itemsets can be derived.
 Two such representations are :
 Maximal Frequent Itemsets
 Closed Frequent Itemsets
Maximal Frequent Itemsets
 A frequent itemset is maximal if none of its immediate supersets are frequent.

 In the above lattice, itemsets {a, d}, {a, c, e}, and {b, c, d, e} are maximal
frequent itemsets because all of their immediate supersets are infrequent.
For example, the itemset {a, d} is maximal frequent because all of its
immediate supersets, {a, b, d}, {a, c, d}, and {a, d, e}, are infrequent. In
contrast, {a, c} is non-maximal because one of its immediate supersets,
{a, c, e}, is frequent.
 Maximal frequent itemsets effectively provide a compact representation
of frequent itemsets. In other words, they form the smallest set of itemsets
from which all frequent itemsets can be derived. For example, every
frequent itemset in Figure 5.16 is a subset of one of the three maximal
frequent itemsets, {a, d}, {a, c, e}, and {b, c, d, e}. If an itemset is not a
proper subset of any of the maximal frequent itemsets, then it is either
infrequent (e.g., {a, d, e}) or maximal frequent itself (e.g., {b, c, d, e}).

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
IT 4-1 Regulation: R16 DWBI: UNIT-3

Hence, the maximal frequent itemsets {a, c, e}, {a, d}, and {b, c, d, e}
provide a compact representation of the frequent itemsets shown in above
lattice.
Closed Itemsets
 Closed itemsets provide a minimal representation of all itemsets without
losing their support information.
 An itemset X is closed if none of its immediate supersets has exactly the
same support count as X. Put another way, X is not closed if at least one
of its immediate supersets has the same support count as X.
 An interesting property of closed itemsets is that if we know their support
counts, we can derive the support count of every other itemset in the
itemset lattice without making additional passes over the data set.
Closed Frequent Itemset
 An itemset is a closed frequent itemset if it is closed and its support is
greater than or equal to minsup.

 The advantage of using closed frequent itemsets is, it is often sufficient to


present only the closed frequent itemsets to the analysts instead of the
entire set of frequent itemsets.
 All maximal frequent itemsets are closed because none of the maximal
frequent itemsets can have the same support count as their immediate
supersets.
 Maximal itemsets are a subset of the set of closed itemsets, which are a
subset of all frequent itemsets.
 A maximal frequent itemset is an itemset that has no immediate
superset that is frequent.
 A closed frequent itemset is an itemset that has no immediate
superset that has the same support.
 All maximal frequent itemsets are closed but not vice versa

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
IT 4-1 Regulation: R16 DWBI: UNIT-3

 The relationships among frequent, closed, closed frequent, and maximal


frequent itemsets are shown in below figure.

Problem: Find all closed frequent itemsets and maximal frequent itemsets of
the following data set .Let minsup=3
1: A,B,C,E
2: A,C,D,E
3: B,C,E
4: A,C,D,E
5: C,D,E
6: A,D,E
Solution:
Candidate 1-itemsets
Item Count
A 4
B 2
C 5
D 4
E 6
Frequent 1-itemsets
Item Count
A 4
C 5
D 4
E 6

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 26
IT 4-1 Regulation: R16 DWBI: UNIT-3

Candidate 2-itemsets
itemset Count
AC 3
AD 3
AE 4
CD 3
CE 5
DE 4

Frequent 2-itemsets
itemset Count
AC 3
AD 3
AE 4
CD 3
CE 5
DE 4

Candidate 3-itemsets
itemset Count
ACD 2
ACE 3
ADE 3
CDE 3

Frequent 3-itemsets
itemset Count
ACE 3
ADE 3
CDE 3

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
IT 4-1 Regulation: R16 DWBI: UNIT-3

{A} = 4 ; not closed due to {A,E}

{C} = 5 ; not closed due to {C,E}

{D} = 4 ; not closed due to {D,E}, but not maximal due to e.g. {A,D}

{E} = 6 ; closed, but not maximal due to e.g. {D,E}

{A,C} = 3; not closed due to {A,C,E}

{A,D} = 3; not closed due to {A,D,E}

{A,E} = 4; closed, but not maximal due to {A,D,E}

{C,D} = 3; not closed due to {C,D,E}

{C,E} = 5; closed, but not maximal due to {C,D,E}

{D,E} = 4; closed, but not maximal due to {A,D,E}

{A,C,E} = 3; closed and maximal frequent

{A,D,E} = 3; closed and maximal frequent

{C,D,E} = 3; closed and maximal frequent

Therefore Closed frequent itemsets are:


{E},{A,E},{C,E},{D,E},{A,C,E},{A,D,E},{C,D,E}
Maximal frequent itemsets are: {A,C,E},{A,D,E},{C,D,E}

Problem: Find all frequent itemsets, maximal frequent itemsets and closed
frequent itemsets of the following data set. Let minsup=1
TID Items

1 {A,B,C,D}

2 {A,B}

Solution:
Given minsup=1
Frequent Itemsets:
A,B,C,D,AB,AC,AD,BC,BD,CD,ABC,ABD,BCD,ACD,ABCD
Maximal Frequent Itemsets: ABCD
Closed Frequent Itemsets: AB, ABCD

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
IT 4-1 Regulation: R16 DWBI: UNIT-3

Tutorial Questions
1. Discuss Apriori algorithm with suitable example.
(or) Write an algorithm for finding frequent item sets using candidate
generation.

2. Write the algorithm to discover frequent item sets without candidate


generation and explain it with an example.
(or) Explain FP-growth algorithm for generating the frequent patterns with
an example. Write its advantages over other mining algorithms.

3. Briefly describe the relation among frequent. maximal frequent and closed
frequent item sets

4. How to represent FP-tree? Explain with an example.


(OR) Briefly describe the process of constructing an FP-tree with an example.

5. A database has four transactions. Let min-sup =60% and min-conf = 80%. Find all
frequent itemsets using Apriori and List all the strong association rules.

9. A database has five transactions. Let min-sup =60% and min-conf =


80%.Find all frequent itemsets using FP-growth and List all the strong
association rules.

10. Consider the transactions occurring the in the given order: {{a,b}, {b,c,d},
{a,b,c}, {a,b,c,d}, {a,b,c}}. Draw FP tree after each transaction.

Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 29

You might also like