Unit 4 - Association Analysis
Unit 4 - Association Analysis
com
Data Mining and Data Warehousing Unit 4: Association Analysis
Association Analysis
Introduction
Many business enterprises accumulate large quantities of data from their day-to-day operations.
For example, huge amounts of customer purchase data are collected daily at the checkout counters
of grocery stores. Table 1 illustrates an example of such data, commonly known as market basket
transactions. Each row in this table corresponds to a transaction, which contains a unique identifier
labeled TID and a set of items bought by a given customer. Retailers are interested in analyzing
the data to learn about the purchasing behavior of their customers. Such valuable information can
be used to support a variety of business-related applications such as marketing promotions,
inventory management, and customer relationship management.
TID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
{Diapers} → {Beer}
The rule suggests that a strong relationship exists between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use this type of rules to help them
identify new opportunities for cross-selling their products to the customers.
Association rules take the form “If antecedent, then consequent,” along with a measure of the
support and confidence associated with the rule. Typically, association rules are considered
Arjun Lamichhane 1
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.
Such thresholds can be set by users or domain experts. For example, a particular supermarket may
find that of the 1000 customers shopping on a Thursday night, 200 bought diapers, and of the 200
who bought diapers, 50 bought beer. Thus, the association rule would be: “If buy diapers, then buy
beer,” with a support of 50/1000 = 5% and a confidence of 50/200 = 25%.
If the items or attributes in an association rule reference only one dimension, then it is a single-
dimensional association rule. For example:
If a rule references two or more dimensions, such as the dimensions age, income, and buys, then
it is a multidimensional association rule. The following rule is an example of a multidimensional
rule:
age(X, “30. . . 39”) ∧ income(X, “42K . . . 48K”) → buys(X, “high resolution TV”)
Investigating the proportion of subscribers to your company’s cell phone plan that respond
positively to an offer of a service upgrade
Examining the proportion of children whose parents read to them who are themselves good
readers
Predicting degradation in telecommunications networks
Finding out which items in a supermarket are purchased together, and which items are
never purchased together
Determining the proportion of cases in which a new drug will exhibit dangerous side effects
The basic ideas of Association rule mining involve a number of new definitions
Itemsets
Let I = {I1, I2. . . I3} be a set of items. An itemset is a set of items contained in I, and a k-itemset is
an itemset containing k items. In other words, itemset is collection of zero or more items. For
example, {Bread, Milk} is a 2-itemset, and {Bread, Milk, Cola} is a 3-itemset, each from the Item
set in table 1.
Arjun Lamichhane 2
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
Support count
Support count for an itemset is the number of transaction or market basket containing all items in
the itemset l.
Support
The support s for a particular association rule A → B is the proportion of transactions in D that
contain both A and B. That is,
Support of a rule determines how often a rule is applicable to a given data set.
Confidence
The confidence c of the association rule A → B is a measure of the accuracy of the rule, as
determined by the percentage of transactions in D containing A that also contains B. In other
words,
Confidence determines how frequently items in B appear in the transactions that contain A.
Consider the rule {Milk, Diapers} → {beer}. Since the support count for {Milk, Diapers, Beer} is
2 and the total number of transactions is 5, so the rule's support is 2/5 = 0.4. The rule's confidence
is obtained by dividing the support count for {Milk, Diapers, Beer} i.e. items in both antecedent
and consequent (or A∪B in the rule A→B) by the support count for {Milk, Diapers} i.e. items in
only antecedent (or A in the rule A→B). Since there are 3 transactions that contain both milk and
diapers, the confidence for this rule is 2/3 = 0.67
Arjun Lamichhane 3
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
A brute-force approach for mining association rules is to compute the support and confidence for
every possible rule. This approach is prohibitively expensive because there are exponentially many
rules that can be extracted from a data set. More specifically, the total number of possible rules
extracted from a data set that contains d items is
R=3d -2d+r + 1
Even for the small data set shown in Table 6.1, this approach requires us to compute the support
and confidence for 36 - 27 + 1 = 602 rules. More than 80% of the rules are discarded after applying
minsup = 20% and minconf = 50%, thus making most of the computations become wasted.
A common strategy adopted by many association rule mining algorithms is to decompose the
problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the minsup
threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent
itemsets found in the previous step. These rules are called strong rules.
Itemset Lattice
A lattice structure can be used to enumerate the list of all possible itemsets. Figure 1 shows an
itemset lattice for {a, b, c, d, e}. In general, a data set that contains k items can potentially generate
up to 2k - 1 frequent itemsets, excluding the null set. Because k can be very large in many practical
applications, the search space of itemsets that need to be explored is exponentially large.
Apriori Principle
If an itemset is frequent, then all of its subsets must also be frequent.
Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e} must also
contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is frequent, then
all subsets of {c, d, e} (i.e., the shaded itemsets in figure 1) must also be frequent.
Arjun Lamichhane 4
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
Figure 1 An illustration of the Apriori principle. If {c, d, e} is frequent then all subset of this itemset are frequent
Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent
too. As illustrated in Figure 2, the entire subgraph containing the supersets of {a, b} can be pruned
immediately once {a, b} is found to be infrequent. This strategy of trimming the exponential search
space based on the support measure is known as support-based pruning. Such a pruning strategy
is made possible by a key property of the support measure, namely, that the support for an itemset
never exceeds the support for its subsets. This property is also known as the anti-monotone
property of the support measure.
Arjun Lamichhane 5
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
Figure 2 An illustration of support-based pruning. If {a, b} is infrequent then all superset of {a, b} are also infrequent.
Apriori Algorithm
1. Let k=1
2. Generate frequent itemsets of length 1
3. Repeat until no new frequent itemsets are identified
3.1 Generate length (k+1) candidate itemsets from length k frequent itemsets
3.2 Prune candidate itemsets containing subsets of length k that are infrequent
3.3 Count the support of each candidate by scanning the DB
3.4 Eliminate (prune) candidates that are infrequent, leaving only those that are frequent.
For the transaction database, D of following table, we use apriori algorithm to compute frequent
itemset. We suppose minsup count=2.
Arjun Lamichhane 6
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
For every nonempty subset s of l, output the rule “s → (l - s) if support count(l)/support count(s)
≥ min_conf, where min_conf is the minimum confidence threshold.
Because the rules are generated from frequent itemsets, each one automatically satisfies minimum
support threshold.
Example
From apriori method we have computed many frequent itemsets. One of them is I= {A, B, E}. The
nonempty subsets of I are {A}, {B}, {E}, {A, B}, {A, E} and {B, E}. The association rules that
can be generated are as follows:
Arjun Lamichhane 7
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
So, if we suppose the minimum confidence to be 70%, only {E} → {A, B}, {A, E} → {B},
{B, E} → {A} are strong association rules.
Remember that association rules can be derived from each of 2-frequent itemset, 3-frequent
itemset and so on.
FP growth
FP-growth algorithm takes a radically different approach to discovering frequent itemsets than
apriori algorithm. The algorithm does not follow the generate-and-test paradigm of Apriori.
Instead, it encodes the data set using a compact data structure called an FP-tree and extracts
frequent itemsets directly from this structure.
For Example
The first scan of the database is the same as Apriori, which derives the set of frequent items (1-
itemsets) and their support counts (frequencies).
The set of frequent items is sorted in the order of descending support count. This resulting set or
list is denoted L. Thus, we have L= {{B: 7}, {A: 6}, {C: 6}, {D: 2}, {E: 2}}.
Arjun Lamichhane 8
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
An FP-tree is then constructed as follows. First, create the root of the tree, labeled with “null.”
Scan database D a second time. The items in each transaction are processed in L order (i.e., sorted
according to descending support count), and a branch is created for each transaction. For example,
the scan of the first transaction, “T1: A, B, E,” which contains three items (B, A, E in L order),
leads to the construction of the first branch of the tree with three nodes, 〈B: 1〉 〈A: 1〉 and 〈E: 1〉,
where B is linked as a child of the root, A is linked to B, and E is linked to A. The second
transaction, T2, contains the items B and D in L order, which would result in a branch where B is
linked to the root and D is linked to B. However, this branch would share a common prefix, B,
with the existing path for T1. Therefore, we instead increase the count of the B node by 1, and
create a new node, 〈D: 1〉, which is linked as a child of 〈B: 2〉. In general, when considering the
branch to be added for a transaction, the count of each node along a common prefix is incremented
by 1, and nodes for the items following the prefix are created and linked accordingly.
To facilitate tree traversal, an item header table is built so that each item points to its occurrences
in the tree via a chain of node-links. The tree obtained after scanning all of the transactions is
shown in Figure 3, with the associated node-links. In this way, the problem of mining frequent
patterns in databases is transformed to that of mining the FP-tree.
FP-Tree mining
The FP-tree is mined as follows. Start from each frequent length-1 pattern (as an initial suffix
pattern), construct its conditional pattern base (a “sub database,” which consists of the set of prefix
paths in the FP-tree co-occurring with the suffix pattern), then construct its (conditional) FP-tree,
Arjun Lamichhane 9
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
and perform mining recursively on such a tree. The pattern growth is achieved by the concatenation
of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
We first consider E, which is the last item in L, rather than the first. E occurs in two branches of
the FP-tree of Figure 3. (The occurrences of E can easily be found by following its chain of node-
links.) The paths formed by these branches are {{B, A, E: 1} and {B, A, C, E: 1}}. Therefore,
considering E as a suffix, its corresponding two prefix paths are {B, A: 1} and {B, A, C: 1}, which
form its conditional pattern base. Its conditional FP-tree contains only a single path, 〈 B: 2, A: 2 〉;
C is not included because its support count of 1 is less than the minimum support count. The single
path generates all the combinations of frequent patterns: {B, E: 2}, {A, E: 2} and {B, A, E: 2}.
Similarly we can compute all frequent patterns as:
Hence, the frequent patterns generated are: {B, E: 2}, {A, E: 2}, {B, A, E: 2}, {B, D: 2}, {B, C:
4}, {A, C: 4}, {B, A, C: 2} and {B, A: 4}
The FP-growth method transforms the problem of finding long frequent patterns to searching for
shorter ones recursively and then concatenating the suffix. It uses the least frequent items as a
suffix, offering good selectivity. The method substantially reduces the search costs.
Arjun Lamichhane 10
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
This rule suggests that most Internet users who shop online are concerned about their personal
privacy.
To extract such patterns, the categorical and symmetric binary attributes are transformed into
"items" first, so that existing association rule mining algorithms can be applied. This type of
transformation can be performed by creating a new item for each distinct attribute-value pair. For
example, the nominal attribute Level of Education can be replaced by three binary items:
Education = College, Education = Graduate, and Education = High School. Similarly, symmetric
binary attributes such as Gender can be converted into a pair of binary items, MaIe and Female.
Table 4 shows the result of binarizing the Internet survey data.
Table 4 Internet survey data after binarizing categorical and symmetric binary attributes
Arjun Lamichhane 11
https://genuinenotes.com
Data Mining and Data Warehousing Unit 4: Association Analysis
There are several issues to consider when applying association analysis to the binarized data:
Some attribute values may not be frequent enough to be part of a frequent pattern. This problem
is more evident for nominal attributes that have many possible values, e.g., state names. Lowering
the support threshold does not help because it exponentially increases the number of frequent
patterns found (many of which may be spurious) and makes the computation more expensive. A
more practical solution is to group related attribute values into a small number of categories. For
example, each state name can be replaced by its corresponding geographical region, such as
Midwest, Pacific Northwest, Southwest, and East Coast. Another possibility is to aggregate the
less frequent attribute values into a single category called “Others”.
Some attribute values may have considerably higher frequencies than others. For example, suppose
85% of the survey participants own a home computer. By creating a binary item for each attribute
value that appears frequently in the data, we may potentially generate many redundant patterns.
References
[1] J. Han and K. Micheline, Data Mining: Concepts and Techniques, San Francisco: Elsevier
Inc., 2006.
[2] P.N. Tan, M. Steinbach and V. Kumar, INTRODUCTION TO DATA MINING, New York:
PEARSON Addison Wesley, 2006.
[3] I. H. Witten and E. Frank, Data Mining Practical Machine Learning Tools and Techniques,
San Francisco: Morgan Kaufmann Publishers, 2005.
Arjun Lamichhane 12