0% found this document useful (0 votes)
14 views

Module 3 DM Notes For 2nd Internals

The document discusses association analysis and frequent itemset mining. It introduces key concepts like itemsets, support count, association rules, and the Apriori algorithm. The Apriori algorithm uses an iterative approach and the Apriori principle to efficiently find all frequent itemsets without considering infrequent supersets of infrequent itemsets.

Uploaded by

xekaki3647
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module 3 DM Notes For 2nd Internals

The document discusses association analysis and frequent itemset mining. It introduces key concepts like itemsets, support count, association rules, and the Apriori algorithm. The Apriori algorithm uses an iterative approach and the Apriori principle to efficiently find all frequent itemsets without considering infrequent supersets of infrequent itemsets.

Uploaded by

xekaki3647
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Module III

ASSOCIATION ANALYSIS
4.1 Introduction
• Many business enterprises accumulate large quantities of data from their day-to-day operations,
huge amounts of customer purchase data are collected daily at the checkout counters of grocery
stores such data, commonly known as market basket transactions.
• Retailers are interested in analyzing the data to learn about the purchasing behavior of their
customers. Such valuable information can be used to support a variety of business-related
applications such as marketing promotions, inventory management, and customer relationship
management.
• Association analysis is useful for discovering interesting relationships hidden in large data sets. The
uncovered relationships can be represented in the form of association rules or sets of frequent items.
For example, the following rule {Diapers} → {Beer} can be extracted from the data set shown in
below Table.

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

The rule suggests that a strong relationship exists between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use this type of rules to help them
identify new opportunities for cross- selling their products to the customers.

There are two key Issues that need to be addressed when applying association analysis to market
basket data.
• First, discovering patterns from a large transaction data set can be computationally
expensive.
• Second, some of the discovered patterns are potentially spurious (fake) because they may
happen simply by chance.
An item can be treated as a binary variable whose value is one if the item is present in a transaction
and zero otherwise. Because the presence of an item in a transaction is often considered more
important than its absence, an item is an asymmetric binary variable.

Prof. Sowmya S K & Akshatha Bhayyar Page 1


Table 4.2 A binary 0/1 representation of market basket data.

TID Bread Milk Diapers Beer Eggs Cola

1 1 1 0 0 0 0

2 1 0 1 1 1 0

3 0 1 1 1 0 1

4 1 1 1 1 0 0

5 1 1 1 0 0 1

This representation is perhaps a very simplistic view of real market basket data because it
ignores certain important aspects of the data such as the quantity of items sold or the price paid to
purchase them.

Itemset and Support Count Let I = {i1,i2,. . .,id} be the set of all items in a market basket data
and T = {t1, t2, . . . , tN } be the set of all transactions. Each transaction ti contains a subset of items
chosen from I. In association analysis, a collection of zero or more items is termed an itemset. If an
itemset contains k items, it is called a k-itemset. For instance, {Beer, Diapers, Milk} is an example
of a 3-itemset. The null (or empty) set is an itemset that does not contain any items.

The transaction width is defined as the number of items present in a transaction. A transaction
tj is said to contain an itemset X if X is a subset of tj. For example, the second transaction shown in
Table 4.2 contains the item-set {Bread, Diapers} but not {Bread, Milk}.
An important property of an itemset is its support count, which refers to the number of
transactions that contain a particular itemset. Mathematically, the support count, σ(X), for an itemset
X can be stated as follows:

Where the symbol | · | denote the number of elements in a set. In the data set shown in Table
4.2, the support count for {Beer, Diapers, Milk} is equal to two because there are only two
transactions that contain all three items.
Association Rule An association rule is an implication expression of the form X → Y , where
X and Y are disjoint itemsets, i.e., X ∩ Y = 0. The strength of an association rule can be
measured in terms of its support and confidence.

Prof. Sowmya S K & Akshatha Bhayyar Page 2


l Support determines how often a rule is applicable to a given data set, while confidence determines
how frequently items in Y appear in transactions that contain X. The formal definitions of these
metrics are,

Ex: ({Milk, Beer, Diaper}) = 2


S = 2/5 = 0.4
C = 2/3 = 0.67

Frequent Itemset
– An itemset whose support is greater than or equal to a minsup(minimum support) threshold

Formulation of Association Rule Mining Problem The association rule mining problem can be
formally stated as follows:
Definition 4.1 (Association Rule Discovery). Given a set of transactions T , find all the rules
having support ≥ minsup and confidence ≥ minconf, where minsup and minconf are the
corresponding support and confidence thresholds.
From Equation 6.2, notice that the support of a rule X → Y depends only on the support of its
corresponding itemset, X Ս Y. For example, the following rules have identical support because they
involve items from the same itemset, {Beer, Diapers, Milk}:
{Beer, Diapers} →{Milk}, {Beer, Milk} →{Diapers}, {Diapers, Milk} →{Beer},
{Beer}→{Diapers, ilk}, {Milk} →{Beer, Diapers}, {Diapers} →{Beer, Milk}.

If the itemset is infrequent, then all six candidate rules can be pruned immediately without
having to compute their confidence values. Therefore, a common strategy adopted by many
association rule mining algorithms is to decompose the problem into two major subtasks:

1. Frequent Itemset Generation, whose objective is to find all the item-sets that
satisfy theminsup threshold. These itemsets are called frequent itemsets.

2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.

Prof. Sowmya S K & Akshatha Bhayyar Page 3


The computational requirements for frequent itemset generation are generally more expensive
than those of rule generation.

Figure 4.1. An itemset lattice.

4.2 Frequent Itemset Generation


A lattice structure can be used to enumerate the list of all possible itemsets. Figure 4.1 shows an
itemset lattice for I = {a, b, c, d, e}. In general, a data set that contains k items can potentially
generate up to 2k - 1 frequent itemsets, excluding the null set. Because k can be very large in many
practical applications, the search space of itemsets that need to be explored is exponentially large.

Figure 6.2. Counting the support of candidate itemsets.

A brute-force approach for finding frequent itemsets is to determine the support count for every
candidate itemset in the lattice structure. To do this, we need to compare each candidate against
every transaction, an operation that is shown in Figure 4.2. If the candidate is contained in a

Prof. Sowmya S K & Akshatha Bhayyar Page 4


transaction, its support count will be incremented. For example, the support for {Bread,Milk} is
incremented three times because the itemset is contained in transactions 1, 4, and 5. Such an
approach can be very expensive because it requires O(NMw) comparisons, where N is the number
of transactions, M =2k - 1 is the number of candidate itemsets, and w is the maximum transaction
width.
There are several ways to reduce the computational complexity of frequent itemset generation.
1. Reduce the number of candidate itemsets (M). The Apriori principle, is an effective way to
eliminate some of the candidate itemsets without counting their support values.

2. Reduce the number of comparisons. Instead of matching each candidate itemset against every
transaction, we can reduce the number of comparisons by using more advanced data structures, either
to store the candidate itemsets or to compress the data set.

4.2.1 The Apriori Principle


describes how the support measure helps to reduce the number of candidate itemsets explored
during frequent itemset generation. The use of support for pruning candidate itemsets is guided by
the following principle.
Theorem 4.1 (Apriori Principle). If an itemset is frequent, then all of its subsets must also be
frequent. To illustrate the idea behind the Apriori principle, consider the itemset lattice shown in
Figure 4.3. Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e}
must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}. As a result, if {c, d, e} is
frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets in this figure) must also be frequent.

Figure 4.3. An illustration of the Apriori principle If {c, d, e} is frequent, then all subsets of this
itemset are frequent.

Prof. Sowmya S K & Akshatha Bhayyar Page 5


Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent
too. As illustrated in Figure 4.4, the entire subgraph containing the supersets of {a, b} can be pruned
immediately once {a, b} is found to be infrequent. This strategy of trimming the exponential search
space based on the support measure is known as support-based pruning. Such a pruning strategy is
made possible by a key property of the support measure, namely, that the support for an itemset
never exceeds the support for its subsets. This property is also known as the anti-monotone property
of the support measure.

Definition 4.2 (Monotonicity Property). Let I be a set of items, and J =2I be the power set of I.
A measure f is monotone (or upward closed) if

which means that if X is a subset of Y , then f(X) must not exceed f(Y ). On the other hand, f
is anti-monotone (or downward closed) if

which means that if X is a subset of Y , then f(Y ) must not exceed f(X).

Figure 4.4. An illustration of support-based pruning. If {a, b} is infrequent, then all supersets of
{a, b} are infrequent.
4.2.2 Frequent Itemset Generation in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use of support-based
pruning to systematically control the exponential growth of candidate itemsets. Figure 4.5 provides

Prof. Sowmya S K & Akshatha Bhayyar Page 6


a high-level illustration of the frequent itemset generation part of the Apriori algorithm for the
transactions shown below.

Figure 4.5. Illustration of frequent itemset generation using the Apriori algorithm.

We assume that the support threshold is 60%, which is equivalent to a minimum support count
equal to 3.
Apriori principle ensures that all supersets of the infrequent 1-itemsets must be infrequent.
Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets generated by the
algorithm is 6. Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found
to be infrequent after computing their support values. The remaining four candidates are frequent,
and thus will be used to generate candidate 3-itemsets. Without support-based pruning, there are =
20 candidate 3-itemsets that can be formed using the six items given in this example. With the
Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent. The only
candidate that has this property is {Bread, Diapers, Milk}.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of
candidate itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3) as
candidates will produce

candidates. With the Apriori principle, this number decreases to

Prof. Sowmya S K & Akshatha Bhayyar Page 7


candidates, which represents a 68% reduction in the number of candidate itemsets even in this
simple example.
The pseudocode for the frequent itemset generation part of the Apriori algorithm is shown in
Algorithm 6.1. Let Ck denote the set of candidate k-itemsets and Fk denote the set of frequent k-
itemsets:
• The algorithm initially makes a single pass over the data set to determine the support of each item.
Upon completion of this step, the set of all frequent 1-itemsets, F1, will be known (steps 1 and 2).
• Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent
(k - 1)-itemsets found in the previous iteration (step 5). Candidate generation is implemented
using a function called apriori-gen.

• To count the support of the candidates, the algorithm needs to make an additional pass over the data
set (steps 6–10). The subset function is used to determine all the candidate itemsets in Ck that are
contained in each transaction t.
• After counting their supports, the algorithm eliminates all candidate itemsets whose support counts
are less than minsup (step 12).
• The algorithm terminates when there are no new frequent itemsets generated ie.,
Fk = Φ (step 13)
Two important characteristics are,
• Level - wise algorithm
• Generate-and-test strategy
4.2.3 Candidate Generation and Pruning
• Candidate Generation: This operation generates new candidate k-itemset based onthe
frequent (k - 1)-itemsets found in the previous iteration.

Prof. Sowmya S K & Akshatha Bhayyar Page 8


• Candidate Pruning: This operation eliminates some of the candidate k-itemsets using the
support-based pruning strategy
There are many ways to generate candidate itemsets. List of requirements for effective candidate
generation procedure are,
1. It should avoid generating too many unnecessary candidates.
2. It must ensure that the candidate set is complete i.e. no frequent items are left.
3. It should not generate the same candidate itemset more than once.
Several candidate generation procedures,
1. Brute force method

Prof. Sowmya S K & Akshatha Bhayyar Page 9


Prof. Sowmya S K & Akshatha Bhayyar Page 10
4.2.4 Support Counting
• Support counting is the process of determining the frequency of occurrence forevery candidate
itemset that survives the candidate pruning step of the apriori-gen function.
• One approach for doing this is to compare each transaction against every candidate itemset.
• This approach is computationally expensive, especially when the numbers of transactions and
candidate itemsets are large.
• An alternative approach is to enumerate the itemsets contained in each transaction and
use them to update the support counts of their respective candidate itemsets.
• To illustrate, consider a transaction “t” that contains five items, {1,2,3,5,6}. There are 10

itemsets of size 3 contained in this transaction.

• Some of the itemsets may correspond to the candidate 3-itemsets underinvestigation, in which
case, their support counts are incremented.
• Other subsets of t that do not correspond to any candidates can be ignored.
• Figure-6.9 below shows a systematic way for enumerating the 3-itemsets containedin t.

Prof. Sowmya S K & Akshatha Bhayyar Page 11


• Assuming that each itemset keeps its items in increasing lexicographic order, an itemset can be
enumerated by specifying the smallest item first, followed by the larger items.
• For instance, given t = {1,2,3,5,6}, all the 3-itemsets contained in t must begin with item
1, 2, or 3.
• It is not possible to construct a 3-itemset that begins with items 5 or 6 because there are only two
items in t whose labels are greater than or equal to 5.
• The number of ways to specify the first item of a 3-itemset contained in (is illustrated by the Level
1 prefix structures depicted in Figure 6.9. For instance, 1 2 3 5 6 represents a 3-itemset that begins
with item 1, followed by two more items chosen from the set {2,3,5,6}.
• After fixing the first item, the prefix structures at Level 2 represent the number of ways to select
the second item. For example, 1 2 3 5 6 corresponds to itemsets that begin with prefix (1 2) and are
followed by items 3, 5, or 6.

• Finally, the prefix structures at Level 3 represent the complete set of 3- itemsets contained in t. For
example, the 3-itemsets that begin with prefix {1 2} are {1,2,3}, {1,2,5}, and {1,2,6}, while those
that begin with prefix {2 3} are {2,3,5} and {2,3,6}.
• The prefix structures shown in Figure 6.9 demonstrate how itemsets contained in a
transaction can be systematically enumerated, i.e., by specifying their items one byone, from the
leftmost item to the rightmost item. We still have to determine whether each enumerated 3-
itemset corresponds to an existing candidate itemset. If it matches one of the candidates, then the
support count of the corresponding candidate is incremented.

Prof. Sowmya S K & Akshatha Bhayyar Page 12


Support Counting Using a Hash Tree

In the Apriori algorithm, candidate itemsets are partitioned into different buckets and stored in a
hash tree. During support counting, itemsets contained in each transaction are also hashed into their
appropriate buckets. That way, instead of comparing each itemset in the transaction with every
candidate itemset, it is matched only against candidate itemsets that belong to the same bucket, as
shown in Figure 6.10.

Figure 6.11 shows an example of a hash tree structure. Each internal node of the tree uses the
following hash function, h(p) = p mod 3, to determine which branch of the current node should be
followed next. For example, items 1, 4, and 7 are hashed to the same branch (i.e., the leftmost
branch) because they have the same remainder after dividing the number by 3. All candidate
itemsets are stored at the leaf nodes of the hash tree. The hash tree shown in Figure 6.11 contains
15 candidate 3-itemsets, distributed across 9 leaf nodes.

Consider a transaction, t= {1,2,3,5,6). To update the support counts of the candidate itemsets, the
hash tree must be traversed in such a way that all the leaf nodes containing candidate 3-itemsets
belonging to t must be visited at least once. Recall that the 3-itemsets contained in t must begin
with items 1, 2, or 3, as indicated by the Level 1 prefix structures shown in Figure 6.9. Therefore,

Prof. Sowmya S K & Akshatha Bhayyar Page 13


at the root node of the hash tree, the items 1, 2, and 3 of the transaction are hashed separately. Item
1 is hashed to the left child of the root node, item 2 is hashed to the middle child, and item 3 is
hashed to the right child.

At the next level of the tree, the transaction is hashed on the second item listed in the Level 2
structures shown in Figure 6.9. For example, after hashing on item 1 at the root node, items 2, 3,
and 5 of the transaction are hashed. Items 2 and 5 are hashed to the middle child, while item 3 is
hashed to the right child, as shown in Figure 6.12. This process continues until the leaf nodes of the
hash tree are reached. The candidate itemsets stored at the visited leaf nodes are compared against
the transaction. If a candidate is a subset of the transaction, its support count is incremented. In this
example, 5 out of the 9 leaf nodes are visited and 9 out of the 15 itemsets are compared against the
transaction.

4.2.5 Computational Complexity

The computational complexity of the Apriori algorithm can be affected by the following factors.

Support Threshold
Lowering the support threshold often results in more itemsets being declared as frequent. This has
an adverse effect on the computational complexity of the algorithm because more candidate
itemsets must be generated and counted, as shown in Figure 6.13. The maximum size of frequent
itemsets also tends to increase with lower support thresholds. As the maximum size of the frequent
itemsets increases, the algorithm will need to make more passes over the data set.

Number of Items (Dimensionality) As the number of items increases, more space will be needed
to store the support counts of items. If the number of frequent items also grows with the
dimensionality of the data, the computation and I/O costs will increase because of the larger
number of candidate itemsets generated by the algorithm.

Prof. Sowmya S K & Akshatha Bhayyar Page 14


Number of Transactions Since the Apriori algorithm makes repeated passes over the data set, its
run time increases with a larger number of trans actions.

Average Transaction Width For dense data sets, the average transaction width can be very large.
This affects the complexity of the Apriori algorithm in two ways.
• First, the maximum size of frequent itemsets tends to increase as the average transaction
width increases. As a result, more candidate itemsets must be examined during candidate
generation and support counting.
• Second, as the transaction width increases, more itemsets are contained in the transaction.
This will increase the number of hash tree traversals performed during support counting.

Time complexity for the Apriori, algorithm


1. Generation of frequent 1-itemsets
– we need to update the support count for every item present in the transaction.
– w is the average transaction width, this operation requires O(Nw) time, where N is the total
number of transactions.
2. Candidate generation
– In the worst-case scenario, the algorithm must merge every pair of frequent (k - 1)-itemsets
found in the previous iteration. the overall cost of merging frequent itemsets is

3.
4.
5.

3. Support counting

a. Rule Generation
Extraction of association rules efficiently from a given frequent itemset is discussed here. Each
frequent k-itemset, Y , can produce up to 2k-2 association rules, ignoring rules that have empty
antecedents or consequents( Φ→Yor Y → Φ). An association rule can be extracted by partitioning
the itemset Y into two non-empty subsets, X and Y -X, such that X → Y - X satisfies the confidence
threshold. Note that all such rules must have already met the support threshold because they are
generated from a frequent itemset.
Example 4 .2. Let X = {1, 2, 3} be a frequent itemset. There are six candidate association rules

Prof. Sowmya S K & Akshatha Bhayyar Page 15


that can be generated from X: {1, 2} →{3}, {1, 3} →{2}, {2, 3}→{1}, {1}→{2, 3}, {2}→{1, 3},
and {3}→{1, 2}. As each of their support is identical to the support for X, the rules must satisfy the
support threshold.
Computing the confidence of an association rule does not require additional scans of the
transaction data set. Consider the rule {1, 2} →{3}, which is generated from the frequent itemset X
= {1, 2, 3}. The confidence for this rule is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the
anti-monotone property of support ensures that {1, 2} must be frequent, too. Since the support
counts for both itemsets were already found during frequent itemset generation, there is no need to
read the entire data set again.

4.3.1 Confidence-Based Pruning

Prof. Sowmya S K & Akshatha Bhayyar Page 16

You might also like