Unit 2
Unit 2
Association Rules: Problem Definition, Frequent Item Set Generation, The APRIORI Principle,
Support and Confidence Measures, Association Rule Generation; APRIOIRI Algorithm, The
Partition Algorithms, FP-Growth Algorithms, Compact Representation of Frequent Item Set-
Maximal Frequent Item Set, Closed Frequent Item Set.
Association Analysis :
Many business enterprises accumulate la.rge quantities of data from their day to day
operations. For example, huge amounts of customer purchase data are collected daily at the
checkout counters of grocery store. Table below illustrates an example of such data, commonly
known as market basket transactions. Each row in this table corresponds to a transaction, which
contains a unique identifier labeled TID and a set of items bought by a given customer. Retailers
are interested in analyzing the data to learn about the purchasing behavior of their customers.
Such valuable information can be used to support a variety of business related applications such
as marketing promotions, inventory management, and customer relationship management.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Association analysis, which is useful for discovering interesting relationships hidden in large
data sets. The uncovered relationships can be represented in the form of association rules or sets
of frequent items. For example, the following rule can be extracted from the data set shown in
the Table.
{Diapers}-----> {Beer}.
The rule suggests that a strong relationship exists between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use this type of rules to help them
identify new opportunities for cross selling their products to the customers. Besides market
basket data, association ana1ysis is also applicable to other application domains such as
bioinformatics, medical diagnosis, Web mining, and scientific data analysis. In the analysis of
Earth science data, for example, the association patterns may reveal interesting connections
among the ocean, land, and atmospheric processes.
There are two key issues that need to be addressed when applying association analysis to market
basket data. First, discovering patterns from a large transaction data set can be computationally
expensive. Second, some of the discovered patterns are potentially spurious because they may
happen simply by chance. The first part of the chapter explains the basic concepts of association
analysis and the algorithms used to efficiently mine such patterns. The second part of the chapter
deals with the issue of evaluating the discovered patterns in order to prevent the generation of
spurious results.
2.1 Problem Definition : The basic terminology used in association analysis is as follows.
Binary Representation Market basket data can be represented in a binary format as shown in
Table.
Support : Fraction of transactions that contain an itemset. E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset : An itemset whose support is greater than or equal to a minsup threshold.
Association Rule: An association rule is an implication expression of the form X →Y, where X
and Y are disjoint itemsets, i.e ., X n Y = 0. The strength of an association rule can be measured
in terms of its support and confidence. Support determines how often a rule is applicable to a
given data set. while confidence determines how frequently items in Y appear in transactions that
contain X. The formal definitions of these metrics are
Association Rule
Example:
{Milk, Diaper} → {Beer}
Support (s)
Confidence (c)
Example:
Given a set of transactions T, the goal of association rule mining is to find all rules
having
Two-step approach:
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each
rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
A lattice structure can be used to enumerate the list of all possible itemsets. Figure above shows
an itemset lattice for I = {a ,b,c,d,e}. In general, a data set that contains k items can potentially
generate up to 2k-1 frequent itemsets, excluding the null set. Because k can be very large in many
practical applications, the search space of itemsets that need to be explored is exponentially
large.
l Brute-force approach:
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Match each transaction against every candidate
2.3 The Apriori Principle: T his section describes how the support measure helps to r educe
the number of candidate itemsets explo red during frequent itemset generation. The use of
support for pruning candidate itemsets is guided by t he following principle. Theorem 6.1 (A
priori Principle). If an itemset is frequent, then all of its subsets must also be frequent. To
illustrate the idea behind the Apriori principle, consider the itemset lattice shown in Figure 6.3.
Suppose { c, d, e) is a frequent it,ell1Set. Clearly, any transaction that contains { c, d, e} must
also contain its subsets, { c, d), {c,e}, {d, e}, {c}, {d}, and {e}. As a result, if {c,d,e} is
frequent, then all subsets of {c,d,e} (i.e., the shaded itemsets in this figure) must also be
frequent.
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
◼ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
◼ Depth-first search
◼ Major philosophy: Grow long patterns from short ones using local frequent items only
• F-list = f-c-a-b-m-p
• Patterns containing p
• Pattern f
➢ Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
32
A Special Case: Single Prefix Path in FP-tree
➢ Completeness
➢ Compactness
• Items in frequency descending order: the more frequently occurring, the more
likely to be shared
• Never be larger than the original database (not count node-links and the count
field)
Method :For each frequent item, construct its conditional pattern-base, and then its
conditional FP-tree.
Repeat the process on each newly created conditional FP-tree
Until the resulting FP-tree is empty, or it contains only one path—single path will
generate all the combinations of its sub-paths, each of which is a frequent pattern.
➢ Divide-and-conquer:
• Decompose both the mining task and DB according to the frequent patterns
obtained so far
➢ Other factors
• Basic ops: counting local freq items and building sub FP-tree, no pattern search
and matching
The number of frequent itemsets produced from a transaction data set can be very large. It is
useful to identify a small representative set of itemsets from which all other frequent itemsets
can be derived. Two such representations are maximal and closed frequent itemsets.
2.6 Maximal Frequent Itemset
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
X is not closed if at least one of its immediate supersets has support count as X.
null
Transaction Ids
TID Items
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE