Module 3 DM Notes For 2nd Internals
Module 3 DM Notes For 2nd Internals
ASSOCIATION ANALYSIS
4.1 Introduction
• Many business enterprises accumulate large quantities of data from their day-to-day operations,
huge amounts of customer purchase data are collected daily at the checkout counters of grocery
stores such data, commonly known as market basket transactions.
• Retailers are interested in analyzing the data to learn about the purchasing behavior of their
customers. Such valuable information can be used to support a variety of business-related
applications such as marketing promotions, inventory management, and customer relationship
management.
• Association analysis is useful for discovering interesting relationships hidden in large data sets. The
uncovered relationships can be represented in the form of association rules or sets of frequent items.
For example, the following rule {Diapers} → {Beer} can be extracted from the data set shown in
below Table.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
The rule suggests that a strong relationship exists between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use this type of rules to help them
identify new opportunities for cross- selling their products to the customers.
There are two key Issues that need to be addressed when applying association analysis to market
basket data.
• First, discovering patterns from a large transaction data set can be computationally
expensive.
• Second, some of the discovered patterns are potentially spurious (fake) because they may
happen simply by chance.
An item can be treated as a binary variable whose value is one if the item is present in a transaction
and zero otherwise. Because the presence of an item in a transaction is often considered more
important than its absence, an item is an asymmetric binary variable.
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
This representation is perhaps a very simplistic view of real market basket data because it
ignores certain important aspects of the data such as the quantity of items sold or the price paid to
purchase them.
Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items in a market basket data
and T = {t1, t2, . . . , tN } be the set of all transactions. Each transaction ti contains a subset of items
chosen from I. In association analysis, a collection of zero or more items is termed an itemset. If an
itemset contains k items, it is called a k-itemset. For instance, {Beer, Diapers, Milk} is an example
of a 3-itemset. The null (or empty) set is an itemset that does not contain any items.
The transaction width is defined as the number of items present in a transaction. A transaction
tj is said to contain an itemset X if X is a subset of tj. For example, the second transaction shown in
Table 4.2 contains the item-set {Bread, Diapers} but not {Bread, Milk}.
An important property of an itemset is its support count, which refers to the number of
transactions that contain a particular itemset. Mathematically, the support count, σ(X), for an itemset
X can be stated as follows:
Where the symbol | · | denote the number of elements in a set. In the data set shown in Table
4.2, the support count for {Beer, Diapers, Milk} is equal to two because there are only two
transactions that contain all three items.
Association Rule An association rule is an implication expression of the form X → Y , where
X and Y are disjoint itemsets, i.e., X ∩ Y = 0. The strength of an association rule can be
measured in terms of its support and confidence.
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup(minimum support) threshold
Formulation of Association Rule Mining Problem The association rule mining problem can be
formally stated as follows:
Definition 4.1 (Association Rule Discovery). Given a set of transactions T , find all the rules
having support ≥ minsup and confidence ≥ minconf, where minsup and minconf are the
corresponding support and confidence thresholds.
From Equation 6.2, notice that the support of a rule X → Y depends only on the support of its
corresponding itemset, X Ս Y. For example, the following rules have identical support because they
involve items from the same itemset, {Beer, Diapers, Milk}:
{Beer, Diapers} →{Milk}, {Beer, Milk} →{Diapers}, {Diapers, Milk} →{Beer},
{Beer}→{Diapers, ilk}, {Milk} →{Beer, Diapers}, {Diapers} →{Beer, Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without
having to compute their confidence values. Therefore, a common strategy adopted by many
association rule mining algorithms is to decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the item-sets that
satisfy theminsup threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.
A brute-force approach for finding frequent itemsets is to determine the support count for every
candidate itemset in the lattice structure. To do this, we need to compare each candidate against
every transaction, an operation that is shown in Figure 4.2. If the candidate is contained in a
2. Reduce the number of comparisons. Instead of matching each candidate itemset against every
transaction, we can reduce the number of comparisons by using more advanced data structures, either
to store the candidate itemsets or to compress the data set.
Figure 4.3. An illustration of the Apriori principle If {c, d, e} is frequent, then all subsets of this
itemset are frequent.
Definition 4.2 (Monotonicity Property). Let I be a set of items, and J =2I be the power set of I.
A measure f is monotone (or upward closed) if
which means that if X is a subset of Y , then f(X) must not exceed f(Y ). On the other hand, f
is anti-monotone (or downward closed) if
which means that if X is a subset of Y , then f(Y ) must not exceed f(X).
Figure 4.4. An illustration of support-based pruning. If {a, b} is infrequent, then all supersets of
{a, b} are infrequent.
4.2.2 Frequent Itemset Generation in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use of support-based
pruning to systematically control the exponential growth of candidate itemsets. Figure 4.5 provides
Figure 4.5. Illustration of frequent itemset generation using the Apriori algorithm.
We assume that the support threshold is 60%, which is equivalent to a minimum support count
equal to 3.
Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent.
Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by the
algorithm is 6. Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found
to be infrequent after computing their support values. The remaining four candidates are frequent,
and thus will be used to generate candidate 3-itemsets. Without support-based pruning, there are =
20 candidate 3-itemsets that can be formed using the six items given in this example. With the
Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent. The only
candidate that has this property is {Bread, Diapers, Milk}.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of
candidate itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3) as
candidates will produce
• To count the support of the candidates, the algorithm needs to make an additional pass over the data
set (steps 6–10). The subset function is used to determine all the candidate itemsets in Ck that are
contained in each transaction t.
• After counting their supports, the algorithm eliminates all candidate itemsets whose support counts
are less than minsup (step 12).
• The algorithm terminates when there are no new frequent itemsets generated ie.,
Fk = Φ (step 13)
Two important characteristics are,
• Level - wise algorithm
• Generate-and-test strategy
4.2.3 Candidate Generation and Pruning
• Candidate Generation: This operation generates new candidate k-itemset based onthe
frequent (k - 1)-itemsets found in the previous iteration.
• Some of the itemsets may correspond to the candidate 3-itemsets underinvestigation, in which
case, their support counts are incremented.
• Other subsets of t that do not correspond to any candidates can be ignored.
• Figure-6.9 below shows a systematic way for enumerating the 3-itemsets containedin t.
• Finally, the prefix structures at Level 3 represent the complete set of 3- itemsets contained in t. For
example, the 3-itemsets that begin with prefix {1 2} are {1,2,3}, {1,2,5}, and {1,2,6}, while those
that begin with prefix {2 3} are {2,3,5} and {2,3,6}.
• The prefix structures shown in Figure 6.9 demonstrate how itemsets contained in a
transaction can be systematically enumerated, i.e., by specifying their items one byone, from the
leftmost item to the rightmost item. We still have to determine whether each enumerated 3-
itemset corresponds to an existing candidate itemset. If it matches one of the candidates, then the
support count of the corresponding candidate is incremented.
In the Apriori algorithm, candidate itemsets are partitioned into different buckets and stored in a
hash tree. During support counting, itemsets contained in each transaction are also hashed into their
appropriate buckets. That way, instead of comparing each itemset in the transaction with every
candidate itemset, it is matched only against candidate itemsets that belong to the same bucket, as
shown in Figure 6.10.
Figure 6.11 shows an example of a hash tree structure. Each internal node of the tree uses the
following hash function, h(p) = p mod 3, to determine which branch of the current node should be
followed next. For example, items 1, 4, and 7 are hashed to the same branch (i.e., the leftmost
branch) because they have the same remainder after dividing the number by 3. All candidate
itemsets are stored at the leaf nodes of the hash tree. The hash tree shown in Figure 6.11 contains
15 candidate 3-itemsets, distributed across 9 leaf nodes.
Consider a transaction, t= {1,2,3,5,6). To update the support counts of the candidate itemsets, the
hash tree must be traversed in such a way that all the leaf nodes containing candidate 3-itemsets
belonging to t must be visited at least once. Recall that the 3-itemsets contained in t must begin
with items 1, 2, or 3, as indicated by the Level 1 prefix structures shown in Figure 6.9. Therefore,
At the next level of the tree, the transaction is hashed on the second item listed in the Level 2
structures shown in Figure 6.9. For example, after hashing on item 1 at the root node, items 2, 3,
and 5 of the transaction are hashed. Items 2 and 5 are hashed to the middle child, while item 3 is
hashed to the right child, as shown in Figure 6.12. This process continues until the leaf nodes of the
hash tree are reached. The candidate itemsets stored at the visited leaf nodes are compared against
the transaction. If a candidate is a subset of the transaction, its support count is incremented. In this
example, 5 out of the 9 leaf nodes are visited and 9 out of the 15 itemsets are compared against the
transaction.
The computational complexity of the Apriori algorithm can be affected by the following factors.
Support Threshold
Lowering the support threshold often results in more itemsets being declared as frequent. This has
an adverse effect on the computational complexity of the algorithm because more candidate
itemsets must be generated and counted, as shown in Figure 6.13. The maximum size of frequent
itemsets also tends to increase with lower support thresholds. As the maximum size of the frequent
itemsets increases, the algorithm will need to make more passes over the data set.
Number of Items (Dimensionality) As the number of items increases, more space will be needed
to store the support counts of items. If the number of frequent items also grows with the
dimensionality of the data, the computation and I/O costs will increase because of the larger
number of candidate itemsets generated by the algorithm.
Average Transaction Width For dense data sets, the average transaction width can be very large.
This affects the complexity of the Apriori algorithm in two ways.
• First, the maximum size of frequent itemsets tends to increase as the average transaction
width increases. As a result, more candidate itemsets must be examined during candidate
generation and support counting.
• Second, as the transaction width increases, more itemsets are contained in the transaction.
This will increase the number of hash tree traversals performed during support counting.
3.
4.
5.
3. Support counting
a. Rule Generation
Extraction of association rules efficiently from a given frequent itemset is discussed here. Each
frequent k-itemset, Y , can produce up to 2k-2 association rules, ignoring rules that have empty
antecedents or consequents( Φ→Yor Y → Φ). An association rule can be extracted by partitioning
the itemset Y into two non-empty subsets, X and Y -X, such that X → Y - X satisfies the confidence
threshold. Note that all such rules must have already met the support threshold because they are
generated from a frequent itemset.
Example 4 .2. Let X = {1, 2, 3} be a frequent itemset. There are six candidate association rules