Chapter-6 (Association Analysis Basic Concepts and Algorithms)
Chapter-6 (Association Analysis Basic Concepts and Algorithms)
Example:
Consider the rule {Milk, Diapers} → {Beer}.
Since the support count for {Milk, Diapers, Beer} is 2 and the total number of transactions is 5, the rule’s
support is 2/5 = 0.4.
The rule’s confidence is obtained by dividing the support count for {Milk, Diapers, Beer} by the support
count for {Milk, Diapers}. Since there are 3 transactions that contain milk and diapers, the confidence for
this rule is 2/3 = 0.67.
Problem Definition
Why Use Support and Confidence?
Support is an important measure because a rule that has very low support may occur simply by chance. A
low support rule is also likely to be uninteresting from a business perspective because it may not be
profitable to promote items that customers seldom buy together. For these reasons, support is often used
to eliminate uninteresting rules. It also has a desirable property that can be exploited for the efficient
discovery of association rules.
Confidence, on the other hand, measures the reliability of the inference made by a rule. For a given rule X
→ Y , the higher the confidence, the more likely it is for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
Association analysis results should be interpreted with caution. The inference made by an association rule
does not necessarily imply causality. Instead, it suggests a strong co-occurrence relationship between
items in the antecedent and consequent of the rule. Causality, on the other hand, requires knowledge
about the causal and effect attributes in the data and typically involves relationships occurring over time
(e.g., ozone depletion leads to global warming).
Problem Definition
There are several ways to reduce the computational complexity of frequent itemset generation.
Reduce the number of candidate itemsets (M). The Apriori principle is an effective way to eliminate
some of the candidate itemsets without counting their support values.
Reduce the number of comparisons. Instead of matching each candidate itemset against every
transaction, we can reduce the number of comparisons by using more advanced data structures, either
to store the candidate itemsets or to compress the data set.
Frequent Itemset Generation
Monotonicity Property
Let I be a set of items, and J = 2I be the power set of I. A measure f is monotone (or upward closed) if
∀ X,Y ∈ J : (X ⊆ Y ) → f(X) ≤ f(Y )
which means that if X is a subset of Y , then f(X) must not exceed f(Y ).
Any measure that possesses an anti-monotone property can be incorporated directly into the mining
algorithm to effectively prune the exponential search space of candidate itemsets.
Frequent Itemset Generation in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use of support-based pruning to
systematically control the exponential growth of candidate itemsets. Here, we assume that the support
threshold is 60%, which is equivalent to a minimum support count = 3 because there are 5 transactions.
Frequent Itemset Generation in the Apriori Algorithm
Initially, every item is considered as a candidate 1-itemset. After counting their supports, the candidate itemsets
{Cola} and {Eggs} are discarded because they appear in fewer than three transactions.
In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets because the Apriori
principle ensures that all supersets of the infrequent 1-itemsets must be infrequent. Because there are only four
frequent 1-itemsets, the number of candidate 2-itemsets generated by the algorithm is 42 = 6.
Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found to be infrequent after
computing their support values.
The remaining four candidates are frequent, and thus will be used to generate candidate 3-itemsets. Without
support-based pruning, there are 63 = 20 candidate 3-itemsets that can be formed using the six items given in
this example. With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent.
The only candidate that has this property is {Bread, Diapers, Milk}.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of candidate itemsets
generated. A brute-force strategy of enumerating all itemsets (up to size 3) as candidates will produce 61 +
6 6
2
+ 3
= 6 + 15 + 20 = 41 candidates.
With the Apriori principle, this number decreases to 61 + 42 + 1 = 6 + 6 +1 = 13 candidates, which represents
a 68% reduction in the number of candidate itemsets even in this simple example.
Frequent Itemset Generation in the Apriori Algorithm
The frequent itemset generation part of the Apriori algorithm has two important characteristics.
It is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time, from frequent 1-
itemsets to the maximum size of frequent itemsets.
It employs a generate-and-test strategy for finding frequent itemsets. At each iteration, new candidate
itemsets are generated from the frequent itemsets found in the previous iteration. The support for each
candidate is then counted and tested against the minsup threshold. The total number of iterations
needed by the algorithm is kmax+1, where kmax is the maximum size of the frequent itemsets.
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation and Pruning
The apriori-gen() (step 5 in Apriori Algorithm) generates candidate itemsets by performing the following
two operations.
1. Candidate Generation: This operation generates new candidate k-itemsets based on the frequent (k −
1)-itemsets found in the previous iteration.
2. Candidate Pruning: This operation eliminates some of the candidate k-itemsets using the support-
based pruning strategy.
Support Counting
Support counting (implemented in steps 6 - 11 in the algorithm) is the process of determining the
frequency of occurrence for every candidate itemset that survives the candidate pruning step of the
apriori-gen().
One approach for doing this is to compare each transaction against every candidate itemset and to update
the support counts of candidates contained in the transaction. This approach is computationally expensive,
especially when the numbers of transactions and candidate itemsets are large.
An alternative approach is to enumerate the itemsets contained in each transaction and use them to
update the support counts of their respective candidate itemsets. To illustrate, consider a transaction t that
contains five items, {1, 2, 3, 5, 6}. There are 53 = 10 itemsets of size 3 contained in this transaction.
Frequent Itemset Generation in the Apriori Algorithm
Support Counting
The adjacent figure shows a systematic way for enumerating the 3-itemsets contained in t.
Each itemset keeps its items in increasing lexicographic order, an itemset can be enumerated by specifying the
smallest item first, followed by the larger items.
The number of ways to specify the first item of a 3-itemset contained in t is illustrated by the Level 1 prefix
structures.
After fixing the first item, the prefix structures at Level 2
represent the number of ways to select the second item.
Finally, the prefix structures at Level 3 represent the complete
set of 3-itemsets contained in t.
The prefix structures demonstrate how itemsets contained in
a transaction can be systematically enumerated, i.e., by
specifying their items one by one, from the leftmost item to
the rightmost item.
However, we still have to determine whether each
enumerated 3-itemset corresponds to an existing candidate
itemset. If it matches one of the candidates, then the support
count of the corresponding candidate is incremented.
Frequent Itemset Generation in the Apriori Algorithm
Support Counting Using a Hash Tree
The candidate itemsets are partitioned into different buckets and stored in a hash tree. During support
counting, itemsets contained in each transaction are also hashed into their appropriate buckets. This way,
instead of comparing each itemset in the transaction with every candidate itemset, it is matched only
against candidate itemsets that belong to the same bucket, as shown in the figure below.
Frequent Itemset Generation in the Apriori Algorithm
Computational Complexity of the Apriori algorithm can be affected by
the following factors.
Support Threshold: Lowering the support threshold often results in more
itemsets being declared as frequent. This has an adverse effect on the
computational complexity of the algorithm because more candidate
itemsets must be generated and counted (refer figure (a)). The maximum
size of frequent itemsets also tends to increase with lower support
thresholds (refer figure (b)). As the maximum size of the frequent
itemsets increases, the algorithm will need to make more passes over
the data set.
Number of Items (Dimensionality): As the number of items increases,
more space will be needed to store the support counts of items. If the
number of frequent items also grows with the dimensionality of the data,
the computation and I/O costs will increase because of the larger
number of candidate itemsets generated by the algorithm.
Number of Transactions: Since the Apriori algorithm makes repeated
passes over the data set, its run time increases with a larger number of
transactions.
Frequent Itemset Generation in the Apriori Algorithm
Average Transaction Width: For dense data sets, the average transaction
width can be very large. This affects the complexity of the Apriori
algorithm in two ways.
The maximum size of frequent itemsets tends to increase as the
average transaction width increases. As a result, more candidate
itemsets must be examined during candidate generation and support
counting (refer figure).
As the transaction width increases, more itemsets are contained in
the transaction. This will increase the number of hash tree traversals
performed during support counting.
Rule Generation
Association rules are extracted from a given frequent itemset. Each frequent k-itemset, Y, can produce up
to 2k−2 association rules, ignoring rules that have empty antecedents or consequents (∅ →Y or Y → ∅).
An association rule can be extracted by partitioning the itemset Y into two non-empty subsets, X and Y −X,
such that X → Y −X satisfies the confidence threshold.
All such rules must have already met the support threshold because they are generated from a frequent
itemset.
Example: Let Y = {1, 2, 3} be a frequent itemset. There are six candidate association rules that can be
generated from X: {1, 2} → {3}, {1, 3} → {2}, {2, 3} → {1}, {1} → {2, 3}, {2} → {1, 3}, and {3} → {1, 2}. As each
of their support is identical to the support for X, the rules must satisfy the support threshold.
Computing the confidence of an association rule does not require additional scans of the transaction data
set.
Consider the rule {1, 2} → {3}, which is generated from the frequent itemset X = {1, 2, 3}. The confidence for
this rule is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the anti-monotone property of support
ensures that {1, 2} must be frequent, too. Since the support counts for both itemsets were already found
during frequent itemset generation, there is no need to read the entire data set again.
Rule Generation
Confidence-Based Pruning
Unlike the support measure, confidence does not have any monotone property.
For example, the confidence for X → Y can be larger, smaller, or equal to the confidence for another
rule 𝑋෨ → 𝑌෨ , where 𝑋෨ ⊆ X and 𝑌෨ ⊆ Y.
Nevertheless, if we compare rules generated from the same frequent itemset Y, the following theorem
holds for the confidence measure.
Theorem: If a rule X → Y−X does not satisfy the confidence threshold, then any rule X’ → Y−X’, where X’ is a
subset of X, must not satisfy the confidence threshold as well.
Proof: Consider the following two rules: X’ → Y −X’ and X → Y−X, where X’ ⊂ X. The confidence of the rules
are σ(Y )/σ(X’) and σ(Y )/σ(X), respectively. Since X’ is a subset of X, σ(X’) ≥ σ(X). Therefore, the former rule
cannot have a higher confidence than the latter rule.
Rule Generation
Rule Generation in Apriori Algorithm
The Apriori algorithm uses a level-wise approach for generating association rules, where each level corresponds
to the number of items that belong to the rule consequent.
Initially, all the high-confidence rules that have only one item in the rule consequent are extracted. These rules
are then used to generate new candidate rules.
For example, if {acd} → {b} and {abd} → {c} are high-
confidence rules, then the candidate rule {ad} → {bc}
is generated by merging the consequents of both
rules.
The adjacent figure shows a lattice structure for the
association rules generated from the frequent itemset {a,
b, c, d}. If any node in the lattice has low confidence, then
according to theorem (refer previous slide), the entire
subgraph spanned by the node can be pruned
immediately. Suppose the confidence for {bcd} → {a} is
low. All the rules containing item a in its consequent,
including {cd} → {ab}, {bd} → {ac}, {bc} → {ad}, and {d} →
{abc} can be discarded.
Rule Generation
Rule Generation in Apriori Algorithm
Algorithm: Rule generation of the Apriori algorithm.
1: for each frequent k-itemset fk, k ≥ 2 do
2: H1 = {i | i ∈ fk} {1-item consequents of the rule}
3: call ap-genrules(fk,H1) Algorithm: Procedure ap-genrules(fk, Hm)
4: end for 1: k = |f | {size of frequent itemset}
k
2: m = |Hm| {size of rule consequent}
3: if k > m+1 then
4: Hm+1 = apriori-gen(Hm)
5: for each hm+1 ∈ Hm+1 do
6: conf = σ(fk)/σ(fk − hm+1)
7: if conf ≥ minconf then
8: output the rule (fk − hm+1) → hm+1
9: else
10: delete hm+1 from Hm+1
11: end if
12: end for
13: call ap-genrules(fk,Hm+1)
14: end if
Compact Representation of Frequent Itemsets
In practice, the number of frequent itemsets produced from a transaction data set can be very large. Therefore,
it is useful to identify a small representative set of itemsets from which all other frequent itemsets can be
derived. Two such representations are maximal and closed frequent itemsets.
Maximal Frequent Itemsets: A maximal frequent itemset is a frequent itemset for which none of its immediate
supersets are frequent.
Consider the itemset lattice shown in the adjacent
figure. The itemsets in the lattice are divided into two
groups: (i) those that are frequent and (ii) those that
are infrequent. Every itemset located above the border
is frequent, while those located below the border (the
shaded nodes) are infrequent. The border itemsets {a,
d}, {a, c, e}, and {b, c, d, e} are maximal frequent
itemsets because their immediate supersets are
infrequent.
For example, the itemset {a, d} is maximal frequent
because all of its immediate supersets, {a, b, d}, {a,
c, d}, and {a, d, e}, are infrequent whereas {a, c} is
non-maximal because one of its immediate
supersets, {a, c, e}, is frequent.
Compact Representation of Frequent Itemsets
Maximal frequent itemsets provide an effective compact representation of frequent itemsets. They form
the smallest set of itemsets from which all frequent itemsets can be derived.
For example, the frequent itemsets shown in figure in previous slide can be divided into two groups:
Frequent itemsets that begin with item a and that may contain items c, d, or e. This group includes
itemsets such as {a}, {a, c}, {a, d}, {a, e}, and {a, c, e}.
Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as {b}, {b, c}, {c,
d}, {b, c, d, e}, etc.
Frequent itemsets that belong in the first group are subsets of either {a, c, e} or {a, d}, while those that
belong in the second group are subsets of {b, c, d, e}. Hence, the maximal frequent itemsets {a, c, e}, {a, d},
and {b, c, d, e} provide a compact representation of the frequent itemsets shown in the figure.
Maximal frequent itemsets provide a valuable representation for data sets that can produce very long,
frequent itemsets, as there are exponentially many frequent itemsets in such data.
However, this approach is practical only if an efficient algorithm exists to explicitly find the maximal
frequent itemsets without having to enumerate all their subsets.
Compact Representation of Frequent Itemsets
Despite providing a compact representation, maximal frequent itemsets do not contain the support
information of their subsets.
For example, the support of the maximal frequent itemsets {a, c, e}, {a, d}, and {b,c,d,e} do not provide
any hint about the support of their subsets. An additional pass over the data set is therefore needed to
determine the support counts of the non-maximal frequent itemsets.
Closed Frequent Itemsets
In some cases, it might be desirable to have a minimal representation of frequent itemsets that preserves
the support information.
Closed itemsets provide a minimal representation of itemsets without losing their support information.
Definition (closed itemset): An itemset X is closed if none of its immediate supersets has exactly the same
support count as X. In other words, X is not closed if at least one of its immediate supersets has the same
support count as X.
Compact Representation of Frequent Itemsets
Examples of closed itemsets are shown in the
adjacent figure.
Each node (itemset) in the lattice is associated with a
list of its corresponding transaction IDs to better
illustrate its support count. For example, since the
node {b, c} is associated with transaction IDs 1, 2,
and 3, its support count is equal to three.
From the transactions given in this diagram, we see
that every transaction that contains b also contains
c. Consequently, the support for {b} is identical to {b,
c} and {b} should not be considered a closed itemset.
Similarly, since c occurs in every transaction that
contains both a and d, the itemset {a, d} is not
closed.
On the other hand, {b, c} is a closed itemset because
it does not have the same support count as any of its
supersets.
Compact Representation of Frequent Itemsets
Definition (Closed Frequent Itemset). An itemset is a closed frequent itemset if it is closed and its support is
greater than or equal to minsup.
In the previous example (previous slide), assuming that the support threshold is 40%, {b,c} is a closed frequent
itemset because its support is 60%. The rest of the closed frequent itemsets are indicated by the shaded
nodes.
Algorithms are available to explicitly extract closed frequent itemsets from a given data set.
We can use the closed frequent itemsets to determine the support counts for the non-closed frequent
itemsets. For example, consider the frequent itemset {a, d} shown in the figure (previous slide). Because the
itemset is not closed, its support count must be identical to one of its immediate supersets.
The key is to determine which superset (among {a, b, d}, {a, c, d}, or {a, d, e}) has exactly the same support count
as {a, d}. The Apriori principle states that any transaction that contains the superset of {a, d} must also contain {a,
d}. However, any transaction that contains {a, d} does not have to contain the supersets of {a, d}. For this reason,
the support for {a, d} must be equal to the largest support among its supersets. Since {a, c, d} has a larger
support than both {a, b, d} and {a, d, e}, the support for {a, d} must be identical to the support for {a, c, d}.
Using this methodology, an algorithm can be developed to compute the support for the non-closed frequent
itemsets.
Compact Representation of Frequent Itemsets
The algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the smallest frequent
itemsets. This is because, in order to find the support for a non-closed frequent itemset, the support for all
of its supersets must be known.
Closed frequent itemsets are useful for removing some of the redundant association rules.
An association rule X → Y is redundant if there exists another rule X’ → Y’, where X is a subset of X’ and
Y is a subset of Y’, such that the support and confidence for both rules are identical.
In the example shown in slide 38, {b} is not a closed frequent itemset while {b, c} is closed. The association
rule {b} → {d, e} is therefore redundant because it has the same support and confidence as {b, c} → {d, e}.
Such redundant rules are not generated if closed frequent itemsets are used for rule generation.
Mining Frequent Itemsets Using the Vertical Data Format
There are many ways to represent a transaction data set. However, the choice of representation can affect
the I/O costs incurred when computing the support of candidate itemsets.
The adjacent figure shows two different ways of representing market basket transactions. The first
representation is called a horizontal data layout (TID-itemset format (i.e., {TID : itemset}), where TID is a
transaction ID and itemset is the set of items bought in transaction TID). It is adopted by many association
rule mining algorithms, including Apriori.
Another possibility is to store the list of transaction identifiers
associated with each item (item-TID set format (i.e., {item : TID
set}), where item is an item name, and TID set is the set of
transaction identifiers containing the item). Such a
representation is known as the vertical data layout.
The support for each candidate itemset is obtained by
intersecting the TID-sets of its subset items. The length of the
TID-sets shrinks as we progress to larger sized itemsets.
However, one problem with this approach is that the initial set of
TID-sets may be too large to fit into main memory, thus requiring
more sophisticated techniques to compress the TID-sets.
Mining Frequent Itemsets Using the Vertical Data Format
FP-Tree Representation
An FP-tree is a compressed representation of the input data.
It is constructed by reading the data set one transaction at a time and mapping each transaction onto a
path in the FP-tree.
As different transactions can have several items in common, their paths may overlap.
The more the paths overlap with one another, the more compression we can achieve using the FP-tree
structure.
If the size of the FP-tree is small enough to fit into main memory, we can extract frequent itemsets directly
from the structure in memory instead of making repeated passes over the data stored on disk.
Each node in the tree contains the label of an item along with a counter that shows the number of
transactions mapped onto the given path.
Mining Frequent Itemsets Using FP-Growth Algorithm
Mining Frequent Itemsets Using FP-Growth Algorithm
Initially, the FP-tree contains only the root node represented by the null symbol. The FP-tree is subsequently
extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent items are discarded,
while the frequent items are sorted in decreasing support counts. In our example, a is the most frequent
item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP-tree. After reading the first
transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null → a → b to
encode the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b, c, d}, a new set of nodes is created for items b, c, and d. A path is
then formed to represent the transaction by connecting the nodes null → b → c → d. Every node along this
path also has a frequency count equal to one. Although the first two transactions have an item in common,
which is b, their paths are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a, c, d, e}, shares a common prefix item (which is a) with the first transaction. As a
result, the path for the third transaction, null → a → c → d → e, overlaps with the path for the first
transaction, null → a → b. Because of their overlapping path, the frequency count for node a is incremented
to two, while the frequency counts for the newly created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths given in the FP-tree.
Mining Frequent Itemsets Using FP-Growth Algorithm
The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common.
In the best-case scenario, where all the transactions have the same set of items, the FP-tree contains only a
single branch of nodes.
The worst-case scenario happens when every transaction has a unique set of items. As none of the
transactions have any items in common, the size of the FP-tree is effectively the same as the size of the
original data. However, the physical storage requirement for the FP-tree is higher because it requires
additional space to store pointers between nodes and counters for each item.
FP-growth finds all the frequent itemsets ending with a particular suffix by employing a divide-and-conquer
strategy to split the problem into smaller subproblems.
For example, suppose we are interested in finding all frequent itemsets ending in e. To do this
We must first check whether the itemset {e} itself is frequent.
If it is frequent, we consider the subproblem of finding frequent itemsets ending in de, followed by ce,
be, and ae.
In turn, each of these subproblems are further decomposed into smaller subproblems.
By merging the solutions obtained from the subproblems, all the frequent itemsets ending in e can be
found.
This divide-and-conquer approach is the key strategy employed by the FP-growth algorithm.
Mining Frequent Itemsets Using FP-Growth Algorithm
Mining Frequent Itemsets Using FP-Growth Algorithm
On how to solve the subproblems, consider the task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial paths are called prefix paths and
are shown in figure (a).
2. From the prefix paths shown in figure (a), the support count for e is obtained by adding the support
counts associated with node e. Assuming that the minimum support count is 2, {e} is declared a
frequent itemset because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent itemsets
ending in de, ce, be, and ae. Before solving these subproblems, it must first convert the prefix paths
into a conditional FP-tree, which is structurally similar to an FP-tree, except it is used to find frequent
itemsets ending with a particular suffix. A conditional FP-tree is obtained in the following way:
a) First, the support counts along the prefix paths must be updated because some of the counts
include transactions that do not contain item e. For example, the rightmost path shown in figure
(a), null → b:2 → c:2 → e:1, includes a transaction {b, c} that does not contain item e. The counts
along the prefix path must therefore be adjusted to 1 to reflect the actual number of transactions
containing {b, c, e}.
b) The prefix paths are truncated by removing the nodes for e. These nodes can be removed because
the support counts along the prefix paths have been updated to reflect only transactions that
contain e and the subproblems of finding frequent itemsets ending in de, ce, be, and ae no longer
need information about node e.
Mining Frequent Itemsets Using FP-Growth Algorithm
c) After updating the support counts along the prefix paths, some of the items may no longer be
frequent. For example, the node b appears only once and has a support count equal to 1, which
means that there is only one transaction that contains both b and e. Item b can be safely ignored
from subsequent analysis because all itemsets ending in be must be infrequent.
The conditional FP-tree for e is shown in Figure (b). The tree looks different than the original prefix
paths because the frequency counts have been updated and the nodes b and e have been eliminated.
4. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding frequent itemsets
ending in de, ce, and ae. To find the frequent itemsets ending in de, the prefix paths for d are gathered
from the conditional FP-tree for e (figure (c)). By adding the frequency counts associated with node d,
we obtain the support count for {d, e}. Since the support count is equal to 2, {d, e} is declared a
frequent itemset. Next, the algorithm constructs the conditional FP-tree for de using the approach
described in step 3. After updating the support counts and removing the infrequent item c, the
conditional FP-tree for de is shown in figure (d). Since the conditional FP-tree contains only one item, a,
whose support is equal to minsup, the algorithm extracts the frequent itemset {a, d, e} and moves on
to the next subproblem, which is to generate frequent itemsets ending in ce. After processing the
prefix paths for c, only {c, e} is found to be frequent. The algorithm proceeds to solve the next
subprogram and found {a, e} to be the only frequent itemset remaining.
Mining Frequent Itemsets Using FP-Growth Algorithm
At each recursive step, a conditional FP-tree is constructed by updating the frequency counts along the
prefix paths and removing all infrequent items. Because the subproblems are disjoint, FP-growth will not
generate any duplicate itemsets. In addition, the counts associated with the nodes allow the algorithm to
perform support counting while generating the common suffix itemsets.
FP-growth illustrates how a compact representation of the transaction data set helps to efficiently generate
frequent itemsets. In addition, for certain transaction data sets, FP-growth outperforms the standard
Apriori algorithm by several orders of magnitude. The run-time performance of FP-growth depends on the
compaction factor of the data set. If the resulting conditional FP-trees are very bushy (in the worst case, a
full prefix tree), then the performance of the algorithm degrades significantly because it has to generate a
large number of subproblems and merge the results returned by each subproblem.
Evaluation of Association Patterns
Association analysis algorithms have the potential to generate a large number of patterns. As the size and
dimensionality of real commercial databases can be very large, we can easily end up with thousands or
even millions of patterns, many of which might not be interesting.
However, identifying the most interesting patterns is not a trivial task because “one person’s trash might be
another person’s treasure”. Therefore, it is important to establish a set of well-accepted criteria for
evaluating the quality of association patterns.
The first set of criteria can be established through statistical arguments. Patterns that involve a set of
mutually independent items or cover very few transactions are considered uninteresting because they may
capture spurious relationships in the data. Such patterns can be eliminated by applying an objective
interestingness measure that uses statistics derived from data to determine whether a pattern is
interesting. Examples of objective interestingness measures include support, confidence, and correlation.
The second set of criteria can be established through subjective arguments. A pattern is considered
subjectively uninteresting unless it reveals unexpected information about the data or provides useful
knowledge that can lead to profitable actions. For example, the rule {Butter} → {Bread} may not be
interesting, despite having high support and confidence values, because the relationship represented by
the rule may seem rather obvious. On the other hand, the rule {Diapers} → {Beer} is interesting because the
relationship is quite unexpected and may suggest a new cross-selling opportunity for retailers.
Evaluation of Association Patterns
However, incorporating subjective knowledge into pattern evaluation is a difficult task because it requires a
considerable amount of prior information from the domain experts.
Some of the approaches for incorporating subjective knowledge into the pattern discovery task
Visualization: This approach requires a user-friendly environment to keep the human user in the loop. It
also allows the domain experts to interact with the data mining system by interpreting and verifying the
discovered patterns.
Template-based approach: This approach allows the users to constrain the type of patterns extracted by
the mining algorithm. Instead of reporting all the extracted rules, only rules that satisfy a user-specified
template are returned to the users.
Subjective interestingness measure: A subjective measure can be defined based on domain information
such as concept hierarchy or profit margin of items. Such measures can then be used to filter patterns that
are obvious and non-actionable.
Evaluation of Association Patterns
Objective Measures of Interestingness
An objective measure is a data-driven approach for evaluating the quality of association patterns. Other
than specifying a threshold for filtering low-quality patterns, It is domain-independent and requires
minimal input from the users. An objective measure is usually computed based on the frequency counts
tabulated in a contingency table.
The adjacent table shows an example of a contingency table for a pair of binary variables, A and B.
The 𝐴ҧ (𝐵)
ത are used to indicate that A (B) is absent from a transaction.
Each entry fij in this table denotes a frequency count. For example, f11 is the B 𝐵ത
number of times A and B appear together in the same transaction, while f01 is the A f11 f10 f1+
number of transactions that contain B but not A.
𝐴ҧ f01 f00 f0+
The row sum f1+ represents the support count for A, while the column sum f+1 f+1 f+0 N
represents the support count for B.
The contingency tables are also applicable to other attribute types such as
symmetric binary, nominal, and ordinal variables.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
Existing association rule mining formulation relies on the support and confidence measures to eliminate
uninteresting patterns. We have already discussed the drawback of support. The drawback of confidence is
more subtle.
Example: Suppose we are interested in analyzing the relationship between people who drink tea and
coffee. The gathered information is shown in the adjacent contingency table. Let us evaluate the
association rule {Tea} →{Coffee}.
At first glance, it may appear that people who drink tea also tend to drink coffee because the rule’s support
(15%) and confidence (75%) values are reasonably high.
This argument would have been acceptable except that the fraction of people who drink coffee, regardless
of whether they drink tea, is 80%, while the fraction of tea drinkers who drink coffee is only 75%.
Thus knowing that a person is a tea drinker actually decreases her Coffee 𝐂𝐨𝐟𝐟𝐞𝐞
probability of being a coffee drinker from 80% to 75%! The rule {Tea} → Tea 150 50 200
{Coffee} is therefore misleading despite its high confidence value. 𝐓𝐞𝐚 650 150 800
800 200 1000
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
The pitfall of confidence is that it ignores the support of the itemset in the rule consequent.
In the previous example, if the support of coffee drinkers is taken into account, we would not be surprised
to find that many of the people who drink tea also drink coffee. More surprisingly, the fraction of tea
drinkers who drink coffee is actually less than the overall fraction of people who drink coffee, which points
to an inverse relationship between tea drinkers and coffee drinkers.
Because of the limitations in the support-confidence framework, various objective measures have been
used to evaluate the quality of association patterns.
To tackle this weakness, a correlation measure can be used to augment the support–confidence framework
for association rules. This leads to correlation rules of the form
A → B [support, confidence, correlation]
That is, a correlation rule is measured not only by its support and confidence but also by the correlation
between itemsets A and B.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
Lift is a simple correlation measure defined as follows. The occurrence of itemset A is independent of the
occurrence of itemset B if P(A ∪ B) = P(A)P(B); otherwise, itemsets A and B are dependent and correlated
as events. This definition can easily be extended to more than two itemsets. The lift between the
occurrence of A and B can be measured by computing
𝑃(𝐴∪𝐵)
Lift(A, B) =
𝑃 𝐴 𝑃(𝐵)
If the resulting value of the equation is < 1, then the occurrence of A is negatively correlated with the
occurrence of B, meaning that the occurrence of one likely leads to the absence of the other one.
If the resulting value is > 1, then A and B are positively correlated, meaning that the occurrence of one
implies the occurrence of the other.
If the resulting value is equal to 1, then A and B are independent and there is no correlation between them.
The above equation is equivalent to P(B|A)/P(B), or conf(A → B)/sup(B), which is also referred to as the lift
of the association (or correlation) rule A → B.
Interest factor compares the frequency of a pattern against a baseline frequency computed under the
statistical independence assumption. The baseline frequency for a pair of mutually independent variables is
f11 f1+ f+1 f1+ f+1
= × or equivalently f11 =
𝑁 𝑁 𝑁 𝑁
This equation follows from the standard approach of using simple fractions as estimates for probabilities.
The fraction f11/N is an estimate for the joint probability P(A,B), while f1+/N and f+1/N are the estimates for
P(A) and P(B), respectively. If A and B are statistically independent, then P(A,B) = P(A) × P(B), thus leading to
the formula shown above. Using the above equations, we can interpret the measure as follows:
I(A,B) = 1, if A and B are independent;
> 1, if A and B are positively correlated;
< 1, if A and B are negatively correlated.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
Correlation Analysis: Correlation analysis is a statistical-based technique for analyzing relationships
between a pair of variables. For continuous variables, correlation is defined using Pearson’s correlation
coefficient. For binary variables, correlation can be measured using the ∅-coefficient, which is defined as
f11f00 − f01f10
∅=
f1+ f+1 f0+ f+0
The value of correlation ranges from −1 (perfect negative correlation) to +1 (perfect positive correlation). If
the variables are statistically independent, then ∅ = 0.
IS Measure: IS is an alternative measure that has been proposed for handling asymmetric binary variables.
The measure is defined as follows:
s(A,B)
IS(A,B) = I(A,B) × s(A,B) =
s(A)s(B)
IS is large when the interest factor and support of the pattern are large.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
It is possible to show that IS is mathematically equivalent to the cosine measure for binary variables. In this
regard, we consider A and B as a pair of bit vectors, A • B = s(A,B) the dot product between the vectors, and
|A| = s(A) the magnitude of vector A. Therefore:
s(A,B) A•B
IS(A,B) = = = cosine(A,B)
s(A)s(B) |A| × |B|
The IS measure can also be expressed as the geometric mean between the confidence of association rules
extracted from a pair of binary variables:
s(A,B) s(A,B)
IS(A,B) = × = c(A → B) × c(B → A)
s(A) s(B)
Because the geometric mean between any two numbers is always closer to the smaller number, the IS
value of an itemset {p, q} is low whenever one of its rules, p → q or q → p, has low confidence.
Effect of skewed Support Distribution
The performances of many association analysis algorithms are influenced by properties of their input data.
For example, the computational complexity of the Apriori algorithm depends on properties such as the
number of items in the data and average transaction width.
The skewed support distribution in the data set (where most of the items have relatively low to moderate
frequencies, but a small number of them have very high frequencies) also has significant influence on the
performance of association analysis algorithms as well as the quality of extracted patterns.
An example of such a real data set is shown in the adjacent
figure. This data, taken from the PUMS (Public Use Microdata
Sample) census data, contains 49,046 records and 2113
asymmetric binary variables.
We shall treat the asymmetric binary variables as items and
records as transactions. While more than 80% of the items have
support less than 1%, a handful of them have support greater
than 90%.
Consider the grouping of the items in the census data set, Group G1 G2 G3
based on their support values, as shown in the adjacent Support < 1% 1% - 90% > 90%
Number of Items 1735 358 20
table.
Effect of skewed Support Distribution
Choosing the right support threshold for mining this data set (i.e., skewed data sets) can be quite tricky. If we set the
threshold too high (e.g., 20%), then we may miss many interesting patterns involving the low support items from G1.
In market basket analysis, such low support items may correspond to expensive products (such as jewelry) that
are seldom bought by customers, but whose patterns are still interesting to retailers.
Conversely, when the threshold is set too low, it becomes difficult to find the association patterns due to the following
reasons.
First, the computational and memory requirements of existing association analysis algorithms increase
considerably with low support thresholds.
Second, the number of extracted patterns also increases substantially with low support thresholds.
Third, we may extract many spurious patterns that relate a high-frequency item such as milk to a low-frequency
item such as caviar. Such patterns, which are called cross-support patterns, are likely to be spurious because their
correlations tend to be weak.
For example, at a support threshold equal to 0.05%, there are 18,847 frequent pairs involving items from G1
and G3. Out of these, 93% of them are cross-support patterns; i.e., the patterns contain items from both G1
and G3. The maximum correlation obtained from the cross-support patterns is 0.029, which is much lower
than the maximum correlation obtained from frequent patterns involving items from the same group (which
is as high as 1.0).
This example shows that a large number of weakly correlated cross-support patterns can be generated when the
support threshold is sufficiently low.
Effect of skewed Support Distribution
Definition (Cross-Support Pattern): A cross-support pattern is an itemset X = {i1, i2, . . . , ik} whose support
ratio
min[s(i1), s(i2), . . . , s(ik)]
r(X) =
max[s(i1), s(i2), . . . , s(ik)]
is less than a user-specified threshold hc.
Example: Suppose the support for milk is 70%, while the support for sugar is 10% and caviar is 0.04%. Given
hc = 0.01, the frequent itemset {milk, sugar, caviar} is a cross-support pattern because its support ratio is
The rule {q} → {r} also has high confidence even though {q, r} is not a
cross-support pattern.
This example demonstrates the difficulty of using the confidence
measure to distinguish between rules extracted from cross-support
and non-cross-support patterns.
Returning to the example, we notice that the rule {p} → {q} has very
low confidence because most of the transactions that contain p do
not contain q. In contrast, the rule {r} → {q}, which is derived from
the pattern {q, r}, has very high confidence.
This observation suggests that cross-support patterns can be
detected by examining the lowest confidence rule that can be
extracted from a given itemset.
Note the equivalence between the upper bound of h-confidence and the support ratio (r) in the definition (Cross-
Support Pattern). Because the support ratio for a cross-support pattern is always less than hc, the h-confidence of
the pattern is also guaranteed to be less than hc.
Therefore, cross-support patterns can be eliminated by ensuring that the h-confidence values for the patterns
exceed hc.
The h-confidence measure is also anti-monotone, i.e., h-confidence({i1, i2, . . . , ik}) ≥ h-confidence({i1, i2, . . . ,
ik+1}), and thus can be incorporated directly into the mining algorithm.
The h-confidence also ensures that the items contained in an itemset are strongly associated with each other.
For example, suppose the h-confidence of an itemset X is 80%. If one of the items in X is present in a
transaction, there is at least an 80% chance that the rest of the items in X also belong to the same
transaction. Such strongly associated patterns are called hyperclique patterns.