0% found this document useful (0 votes)
116 views

Chapter-6 (Association Analysis Basic Concepts and Algorithms)

The document discusses association analysis and association rule mining. It defines key concepts like frequent itemsets, support count, and association rules. Association rule mining involves two main steps - frequent itemset generation to find all itemsets that meet a minimum support threshold, and rule generation to extract high-confidence rules from frequent itemsets. Frequent itemset generation is typically more computationally expensive than rule generation. The document provides examples to illustrate support, confidence, and how decomposing the problem into these two steps can improve algorithm performance.

Uploaded by

bcs2021047
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

Chapter-6 (Association Analysis Basic Concepts and Algorithms)

The document discusses association analysis and association rule mining. It defines key concepts like frequent itemsets, support count, and association rules. Association rule mining involves two main steps - frequent itemset generation to find all itemsets that meet a minimum support threshold, and rule generation to extract high-confidence rules from frequent itemsets. Frequent itemset generation is typically more computationally expensive than rule generation. The document provides examples to illustrate support, confidence, and how decomposing the problem into these two steps can improve algorithm performance.

Uploaded by

bcs2021047
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Data Mining

Association Analysis: Basic Concepts


and Algorithms

Pramod Kumar Singh


Professor (Computer Science and Engineering)
ABV – Indian Institute of Information Technology Management Gwalior
Gwalior – 474015, MP, India
Introduction
Many business enterprises accumulate large quantities of data from their day-to-day operations. For
example, huge amounts of customer purchase data are collected daily at the checkout counters of grocery
stores. Such a dataset is commonly known as market basket transactions (refer the adjacent table). Each
row in this table corresponds to a transaction, which contains a unique identifier labeled TID and a set of
items bought by a given customer.
Retailers are interested in analyzing the data to learn about the purchasing behavior of their customers.
Such valuable information can be used to support a variety of business-related applications such as
marketing promotions, inventory management, and customer relationship management. A methodology
known as association analysis is useful for discovering interesting relationships hidden in large data sets.
The uncovered relationships can be represented in the form of association
rules or sets of frequent items. For example, the following rule can be
extracted from the adjacent data set: TID Items
{Diapers} → {Beer} 1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
The rule suggests that a strong relationship exists between the sale of 3 {Milk, Diapers, Beer, Cola}
diapers and beer because many customers who buy diapers also buy beer. 4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
Retailers can use this type of rules to help them identify new opportunities An example of market basket transactions
for cross-selling their products to the customers.
Problem Definition
Besides market basket data, association analysis is also applicable to other application domains, e.g.,
bioinformatics, medical diagnosis, Web mining, scientific data analysis. There are two key issues that need to be
addressed when applying association analysis to market basket data.
 First, discovering patterns from a large transaction data set can be computationally expensive.
 Second, some of the discovered patterns are potentially spurious because they may happen simply by
chance.
Binary Representation: Market basket data can be represented in a binary format as shown in the Table below,
where each row corresponds to a transaction and each column corresponds to an item. An item can be treated
as a binary variable whose value is one if the item is present in a transaction and zero otherwise. Because the
presence of an item in a transaction is often considered more important than its absence, an item is an
asymmetric binary variable.
Itemset and Support Count Let I = {i1, i2,. . .,id} be the set of all TID Bread Milk Diapers Beer Eggs Cola
items in a market basket data and T = {t1, t2, . . . , tN} be the set of 1 1 1 0 0 0 0
all transactions. Each transaction ti contains a subset of items 2 1 0 1 1 1 0
chosen from I. In association analysis, a collection of zero or more 3 0 1 1 1 0 1
items is termed an itemset. If an itemset contains k items, it is 4 1 1 1 1 0 0
called a k-itemset. For instance, {Beer, Diapers, Milk} is an example 5 1 1 1 0 0 1
of a 3-itemset. The null (or empty) set is an itemset that does not Binary 0/1 representation of market basket data
contain any items.
Problem Definition
The transaction width is defined as the number of items present in a transaction. A transaction tj is said to
contain an itemset X if X is a subset of tj. For example, the second transaction shown in the Table (previous
slide) contains the itemset {Bread, Diapers} but not {Bread, Milk}. An important property of an itemset is
its support count, which refers to the number of transactions that contain a particular itemset.
Mathematically, the support count, σ(X), for an itemset X can be stated as follows:
σ(X) = |{ti|X ⊆ ti, ti ∈ T}|
where the symbol | · | denote the number of elements in a set. In the data set shown in Table (previous
slide), the support count for {Beer, Diapers, Milk} is equal to two because there are only two transactions
that contain all three items.
Association Rule: An association rule is an implication expression of the form X → Y , where X and Y are
disjoint itemsets, i.e., X ∩ Y = ∅. The strength of an association rule can be measured in terms of its
support and confidence. Support determines how often a rule is applicable to a given data set, while
confidence determines how frequently items in Y appear in transactions that contain X. The formal
definitions of these metrics are
σ(X ∪ Y )
Support, s(X → Y) =
𝑁
σ(X ∪ Y )
Confidence, c(X → Y) =
σ(X)
Problem Definition

Example:
Consider the rule {Milk, Diapers} → {Beer}.
Since the support count for {Milk, Diapers, Beer} is 2 and the total number of transactions is 5, the rule’s
support is 2/5 = 0.4.
The rule’s confidence is obtained by dividing the support count for {Milk, Diapers, Beer} by the support
count for {Milk, Diapers}. Since there are 3 transactions that contain milk and diapers, the confidence for
this rule is 2/3 = 0.67.
Problem Definition
Why Use Support and Confidence?
Support is an important measure because a rule that has very low support may occur simply by chance. A
low support rule is also likely to be uninteresting from a business perspective because it may not be
profitable to promote items that customers seldom buy together. For these reasons, support is often used
to eliminate uninteresting rules. It also has a desirable property that can be exploited for the efficient
discovery of association rules.
Confidence, on the other hand, measures the reliability of the inference made by a rule. For a given rule X
→ Y , the higher the confidence, the more likely it is for Y to be present in transactions that contain X.
Confidence also provides an estimate of the conditional probability of Y given X.
Association analysis results should be interpreted with caution. The inference made by an association rule
does not necessarily imply causality. Instead, it suggests a strong co-occurrence relationship between
items in the antecedent and consequent of the rule. Causality, on the other hand, requires knowledge
about the causal and effect attributes in the data and typically involves relationships occurring over time
(e.g., ozone depletion leads to global warming).
Problem Definition

Formulation of Association Rule Mining Problem


Association Rule Discovery: Given a set of transactions T, find all the rules having support ≥ minsup and
confidence ≥ minconf, where minsup and minconf are the corresponding support and confidence
thresholds.
A brute-force approach for mining association rules is to compute the support and confidence for every
possible rule. This approach is prohibitively expensive because exponentially many rules can be extracted
from a data set. The total number of possible rules extracted from a data set containing d items is R = 3d −
2d+1 + 1. Even for the small data set shown in Table (slide: 3), this approach requires to compute the
support and confidence for 36 − 27 + 1 = 602 rules.
More than 80% of the rules are discarded after applying minsup = 20% and minconf = 50%, thus making
most of the computations become wasted. To avoid performing needless computations, it would be useful
to prune the rules early without having to compute their support and confidence values.
Problem Definition
Formulation of Association Rule Mining Problem
An initial step toward improving the performance of association rule mining algorithms is to decouple the
support and confidence requirements. The support of a rule X → Y depends only on the support of its
corresponding itemset, X ∪ Y. For example, the following rules have identical support because they involve
items from the same itemset, {Beer, Diapers, Milk}:
{Beer, Diapers} → {Milk}, {Beer, Milk} → {Diapers},
{Diapers, Milk} → {Beer}, {Beer} → {Diapers, Milk},
{Milk} → {Beer,Diapers}, {Diapers} → {Beer,Milk}.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without our having to
compute their confidence values.
Therefore, a common strategy adopted by many association rule mining algorithms is to decompose the
problem into two major subtasks:
 Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the minsup threshold.
These itemsets are called frequent itemsets.
 Rule Generation, whose objective is to extract all the high-confidence rules from the frequent itemsets
found in the previous step. These rules are called strong rules.
The computational requirements for frequent itemset generation are generally more expensive than those of
rule generation.
Frequent Itemset Generation

A lattice structure can be used to


enumerate the list of all possible itemsets.
In general, a data set containing k items
can potentially generate up to 2k − 1
frequent itemsets, excluding the null set.
Because k can be very large in many
practical applications, the search space of
itemsets that need to be explored is
exponentially large.

An itemset lattice for I = {a, b, c, d, e}


Frequent Itemset Generation
A brute-force approach for finding frequent
itemsets is to determine the support count for
every candidate itemset in the lattice structure. To
do this, we need to compare each candidate
against every transaction as shown in the adjacent
figure.
If the candidate is contained in a transaction, its
support count will be incremented. For example,
the support for {Bread, Milk} is incremented three
times because the itemset is contained in three
transactions 1, 4, and 5.
Such an approach can be very expensive because it
requires O(NMw) comparisons, where N is the
number of transactions, M = 2k −1 is the number of Counting the support of candidate itemsets
candidate itemsets, and w is the maximum
transaction width.
Frequent Itemset Generation

There are several ways to reduce the computational complexity of frequent itemset generation.
 Reduce the number of candidate itemsets (M). The Apriori principle is an effective way to eliminate
some of the candidate itemsets without counting their support values.
 Reduce the number of comparisons. Instead of matching each candidate itemset against every
transaction, we can reduce the number of comparisons by using more advanced data structures, either
to store the candidate itemsets or to compress the data set.
Frequent Itemset Generation

The Apriori Principle:


If an itemset is frequent, then all of its subsets
must also be frequent.

Consider the itemset lattice in the adjacent


figure. Suppose {c, d, e} is a frequent itemset.
Clearly, any transaction that contains {c, d, e}
must also contain its subsets, {c, d}, {c, e}, {d, e},
{c}, {d}, and {e}. As a result, if {c, d, e} is frequent,
then all subsets of {c, d, e} (i.e., the shaded
itemsets in this figure) must also be frequent.
Frequent Itemset Generation

Conversely, if an itemset such as {a, b} is infrequent,


then all of its supersets must be infrequent too.
Refer the adjacent figure. The entire subgraph
containing the supersets of {a, b} can be pruned
immediately once {a, b} is found to be infrequent.
This strategy of trimming the exponential search
space based on the support measure is known as
support-based pruning.
Such a pruning strategy is made possible by a key
property of the support measure, namely, that the
support for an itemset never exceeds the support
for its subsets. This property is also known as the
anti-monotone property of the support measure.
Frequent Itemset Generation

Monotonicity Property
Let I be a set of items, and J = 2I be the power set of I. A measure f is monotone (or upward closed) if
∀ X,Y ∈ J : (X ⊆ Y ) → f(X) ≤ f(Y )
which means that if X is a subset of Y , then f(X) must not exceed f(Y ).

On the other hand, f is anti-monotone (or downward closed) if


∀ X, Y ∈ J : (X ⊆ Y ) → f(Y ) ≤ f(X)
which means that if X is a subset of Y , then f(Y ) must not exceed f(X).

Any measure that possesses an anti-monotone property can be incorporated directly into the mining
algorithm to effectively prune the exponential search space of candidate itemsets.
Frequent Itemset Generation in the Apriori Algorithm
Apriori is the first association rule mining algorithm that pioneered the use of support-based pruning to
systematically control the exponential growth of candidate itemsets. Here, we assume that the support
threshold is 60%, which is equivalent to a minimum support count = 3 because there are 5 transactions.
Frequent Itemset Generation in the Apriori Algorithm
Initially, every item is considered as a candidate 1-itemset. After counting their supports, the candidate itemsets
{Cola} and {Eggs} are discarded because they appear in fewer than three transactions.
In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets because the Apriori
principle ensures that all supersets of the infrequent 1-itemsets must be infrequent. Because there are only four
frequent 1-itemsets, the number of candidate 2-itemsets generated by the algorithm is 42 = 6.
Two of these six candidates, {Beer, Bread} and {Beer, Milk}, are subsequently found to be infrequent after
computing their support values.
The remaining four candidates are frequent, and thus will be used to generate candidate 3-itemsets. Without
support-based pruning, there are 63 = 20 candidate 3-itemsets that can be formed using the six items given in
this example. With the Apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent.
The only candidate that has this property is {Bread, Diapers, Milk}.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of candidate itemsets
generated. A brute-force strategy of enumerating all itemsets (up to size 3) as candidates will produce 61 +
6 6
2
+ 3
= 6 + 15 + 20 = 41 candidates.

With the Apriori principle, this number decreases to 61 + 42 + 1 = 6 + 6 +1 = 13 candidates, which represents
a 68% reduction in the number of candidate itemsets even in this simple example.
Frequent Itemset Generation in the Apriori Algorithm

Algorithm: Frequent itemset generation of the Apriori algorithm


1: k = 1.
2: Fk = { i | i ∈ I ∧ σ({i}) ≥ N × minsup} {Find all frequent 1-itemsets}
3: repeat
4: k = k + 1.
5: Ck = apriori-gen(Fk−1) {Generate candidate itemsets}
6: for each transaction t ∈ T do
7: Ct = subset(Ck, t) {Identify all candidates that belong to t}
8: for each candidate itemset c ∈ Ct do
9: σ(c) = σ(c) + 1 {Increment support count}
10: end for
11: end for
12: Fk = { c | c ∈ Ck ∧ σ(c) ≥ N × minsup} {Extract the frequent k-itemsets}
13: until Fk = ∅
14: Result = ∪ Fk
Frequent Itemset Generation in the Apriori Algorithm

Explanation of Apriori algorithm


 The algorithm initially makes a single pass over the data set to determine the support of each item.
Upon completion of this step, the set of all frequent 1-itemsets, F1, will be known (steps 1 and 2).
 Next, the algorithm will iteratively generate new candidate k-itemsets using the frequent (k − 1)-
itemsets found in the previous iteration (step 5).
 To count the support of the candidates, the algorithm needs to make an additional pass over the data
set (steps 6–10). The subset function is used to determine all the candidate itemsets in Ck that are
contained in each transaction t.
 After counting their supports, the algorithm eliminates all candidate itemsets whose support counts are
less than minsup (step 12).
 The algorithm terminates when there are no new frequent itemsets generated, i.e., Fk = ∅ (step 13).
Frequent Itemset Generation in the Apriori Algorithm

The frequent itemset generation part of the Apriori algorithm has two important characteristics.
 It is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time, from frequent 1-
itemsets to the maximum size of frequent itemsets.
 It employs a generate-and-test strategy for finding frequent itemsets. At each iteration, new candidate
itemsets are generated from the frequent itemsets found in the previous iteration. The support for each
candidate is then counted and tested against the minsup threshold. The total number of iterations
needed by the algorithm is kmax+1, where kmax is the maximum size of the frequent itemsets.
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation and Pruning
The apriori-gen() (step 5 in Apriori Algorithm) generates candidate itemsets by performing the following
two operations.
1. Candidate Generation: This operation generates new candidate k-itemsets based on the frequent (k −
1)-itemsets found in the previous iteration.
2. Candidate Pruning: This operation eliminates some of the candidate k-itemsets using the support-
based pruning strategy.

Candidate pruning operation


Consider a candidate k-itemset, X = {i1, i2, . . . , ik}. The algorithm must determine whether all of its proper
subsets, X − {ij} (∀j = 1, 2, . . . , k), are frequent. If one of them is infrequent, then X is immediately pruned.
This approach can effectively reduce the number of candidate itemsets considered during support
counting. The complexity of this operation is O(k) for each candidate k-itemset. However, we do not need
to examine all k subsets of a given candidate itemset. If m of the k subsets, where m < k, were used to
generate a candidate, we only need to check the remaining k −m subsets during candidate pruning.
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation
There are many ways to generate candidate itemsets. The following is a list of requirements for an effective
candidate generation procedure:
 It should avoid generating too many unnecessary candidates. A candidate itemset is unnecessary if at
least one of its subsets is infrequent. Such a candidate is guaranteed to be infrequent according to the
anti-monotone property of support.
 It must ensure that the candidate set is complete, i.e., no frequent itemsets are left out by the
candidate generation procedure. To ensure completeness, the set of candidate itemsets must contain
the set of all frequent itemsets, i.e., ∀ k : Fk ⊆ Ck.
 It should not generate the same candidate itemset more than once. For example, the candidate
itemset {a, b, c, d} can be generated in many ways—by merging {a, b, c} with {d}, {b, d} with {a, c}, {c}
with {a, b, d}, etc. Generation of duplicate candidates leads to wasted computations and thus should be
avoided for efficiency reasons.
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation :: Brute-Force Method
It considers every k-itemset as a potential candidate and then applies the candidate pruning step to remove
any unnecessary candidates (refer the figure for an example). The number of candidate itemsets generated
at level k is equal to 𝑑𝑘 , where d is the total number of items.

Although candidate generation is rather trivial,


candidate pruning becomes extremely expensive
because a large number of itemsets must be
examined.

Given that the amount of computations needed for


each candidate is O(k), the overall complexity of this
method is O σ𝑑𝑘=1 𝑘 𝑑𝑘 = O(d · 2d−1).
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation :: Fk−1 × F1 Method
It generates candidate k-itemset by extending each frequent (k − 1)-itemset with other frequent items. The
figure below shows how a frequent 2-itemset such as {Beer, Diapers} can be augmented with a frequent
item such as Bread to produce a candidate 3-itemset {Beer, Diapers, Bread}.

This method will produce O(|Fk−1| × |F1|)


candidate k-itemsets, where |Fj| is the number
of frequent j-itemsets. The overall complexity
of this step is O(σ𝑘 𝑘|Fk−1||F1|).
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation :: Fk−1 × F1 Method
The procedure is complete because every frequent k-itemset is composed of a frequent (k − 1)-itemset and a
frequent 1-itemset. Therefore, all frequent k-itemsets are part of the candidate k-itemsets generated by this
procedure.
However, it does not prevent the same candidate itemset from being generated more than once. For instance,
{Bread, Diapers, Milk} can be generated by merging {Bread, Diapers} with {Milk}, {Bread, Milk} with {Diapers}, or
{Diapers, Milk} with {Bread}. One way to avoid generating duplicate candidates is by ensuring that the items in
each frequent itemset are kept sorted in their lexicographic order. Each frequent (k−1)-itemset X is then
extended with frequent items that are lexicographically larger than the items in X. For example, the itemset
{Bread, Diapers} can be augmented with {Milk} since Milk is lexicographically larger than Bread and Diapers.
However, we should not augment {Diapers, Milk} with {Bread} nor {Bread, Milk} with {Diapers} because they
violate the lexicographic ordering condition.
While this procedure is a substantial improvement over the brute-force method, it can still produce a large
number of unnecessary candidates. For example, the candidate itemset obtained by merging {Beer, Diapers}
with {Milk} is unnecessary because one of its subsets, {Beer, Milk}, is infrequent. However, several heuristics are
available to reduce the number of unnecessary candidates. For example, note that, for every candidate k-
itemset that survives the pruning step, every item in the candidate must be contained in at least k −1 of the
frequent (k −1)-itemsets. Otherwise, the candidate is guaranteed to be infrequent.
Frequent Itemset Generation in the Apriori Algorithm
Candidate Generation :: Fk−1 × Fk-1 Method
The candidate generation procedure in the apriori-gen() merges a pair of frequent (k−1)-itemsets only if their
first k−2 items are identical. Let A = {a1, a2, . . . , ak−1} and B = {b1, b2, . . . , bk−1} be a pair of frequent (k − 1)-
itemsets. A and B are merged if they satisfy the following conditions.
ai = bi (for i = 1, 2, . . . , k − 2) and ak−1 ≠ bk−1
In the adjacent figure, the frequent itemsets {Bread, Diapers}
and {Bread, Milk} are merged to form a candidate 3-itemset
{Bread, Diapers, Milk}. The algorithm does not have to merge
{Beer, Diapers} with {Diapers, Milk} because the first item in
both itemsets is different. Indeed, if {Beer, Diapers, Milk} is a
viable candidate, it would have been obtained by merging {Beer,
Diapers} with {Beer, Milk} instead.
This example illustrates both the completeness of the candidate
generation procedure and the advantages of using lexicographic
ordering to prevent duplicate candidates. However, because
each candidate is obtained by merging a pair of frequent (k−1)-
itemsets, an additional candidate pruning step is needed to
ensure that the remaining k−2 subsets of the candidate are
frequent.
Frequent Itemset Generation in the Apriori Algorithm

Support Counting
Support counting (implemented in steps 6 - 11 in the algorithm) is the process of determining the
frequency of occurrence for every candidate itemset that survives the candidate pruning step of the
apriori-gen().
One approach for doing this is to compare each transaction against every candidate itemset and to update
the support counts of candidates contained in the transaction. This approach is computationally expensive,
especially when the numbers of transactions and candidate itemsets are large.
An alternative approach is to enumerate the itemsets contained in each transaction and use them to
update the support counts of their respective candidate itemsets. To illustrate, consider a transaction t that
contains five items, {1, 2, 3, 5, 6}. There are 53 = 10 itemsets of size 3 contained in this transaction.
Frequent Itemset Generation in the Apriori Algorithm
Support Counting
The adjacent figure shows a systematic way for enumerating the 3-itemsets contained in t.
Each itemset keeps its items in increasing lexicographic order, an itemset can be enumerated by specifying the
smallest item first, followed by the larger items.
The number of ways to specify the first item of a 3-itemset contained in t is illustrated by the Level 1 prefix
structures.
After fixing the first item, the prefix structures at Level 2
represent the number of ways to select the second item.
Finally, the prefix structures at Level 3 represent the complete
set of 3-itemsets contained in t.
The prefix structures demonstrate how itemsets contained in
a transaction can be systematically enumerated, i.e., by
specifying their items one by one, from the leftmost item to
the rightmost item.
However, we still have to determine whether each
enumerated 3-itemset corresponds to an existing candidate
itemset. If it matches one of the candidates, then the support
count of the corresponding candidate is incremented.
Frequent Itemset Generation in the Apriori Algorithm
Support Counting Using a Hash Tree
The candidate itemsets are partitioned into different buckets and stored in a hash tree. During support
counting, itemsets contained in each transaction are also hashed into their appropriate buckets. This way,
instead of comparing each itemset in the transaction with every candidate itemset, it is matched only
against candidate itemsets that belong to the same bucket, as shown in the figure below.
Frequent Itemset Generation in the Apriori Algorithm
Computational Complexity of the Apriori algorithm can be affected by
the following factors.
Support Threshold: Lowering the support threshold often results in more
itemsets being declared as frequent. This has an adverse effect on the
computational complexity of the algorithm because more candidate
itemsets must be generated and counted (refer figure (a)). The maximum
size of frequent itemsets also tends to increase with lower support
thresholds (refer figure (b)). As the maximum size of the frequent
itemsets increases, the algorithm will need to make more passes over
the data set.
Number of Items (Dimensionality): As the number of items increases,
more space will be needed to store the support counts of items. If the
number of frequent items also grows with the dimensionality of the data,
the computation and I/O costs will increase because of the larger
number of candidate itemsets generated by the algorithm.
Number of Transactions: Since the Apriori algorithm makes repeated
passes over the data set, its run time increases with a larger number of
transactions.
Frequent Itemset Generation in the Apriori Algorithm

Average Transaction Width: For dense data sets, the average transaction
width can be very large. This affects the complexity of the Apriori
algorithm in two ways.
 The maximum size of frequent itemsets tends to increase as the
average transaction width increases. As a result, more candidate
itemsets must be examined during candidate generation and support
counting (refer figure).
 As the transaction width increases, more itemsets are contained in
the transaction. This will increase the number of hash tree traversals
performed during support counting.
Rule Generation
Association rules are extracted from a given frequent itemset. Each frequent k-itemset, Y, can produce up
to 2k−2 association rules, ignoring rules that have empty antecedents or consequents (∅ →Y or Y → ∅).
An association rule can be extracted by partitioning the itemset Y into two non-empty subsets, X and Y −X,
such that X → Y −X satisfies the confidence threshold.
All such rules must have already met the support threshold because they are generated from a frequent
itemset.

Example: Let Y = {1, 2, 3} be a frequent itemset. There are six candidate association rules that can be
generated from X: {1, 2} → {3}, {1, 3} → {2}, {2, 3} → {1}, {1} → {2, 3}, {2} → {1, 3}, and {3} → {1, 2}. As each
of their support is identical to the support for X, the rules must satisfy the support threshold.

Computing the confidence of an association rule does not require additional scans of the transaction data
set.
Consider the rule {1, 2} → {3}, which is generated from the frequent itemset X = {1, 2, 3}. The confidence for
this rule is σ({1, 2, 3})/σ({1, 2}). Because {1, 2, 3} is frequent, the anti-monotone property of support
ensures that {1, 2} must be frequent, too. Since the support counts for both itemsets were already found
during frequent itemset generation, there is no need to read the entire data set again.
Rule Generation

Confidence-Based Pruning
Unlike the support measure, confidence does not have any monotone property.
For example, the confidence for X → Y can be larger, smaller, or equal to the confidence for another
rule 𝑋෨ → 𝑌෨ , where 𝑋෨ ⊆ X and 𝑌෨ ⊆ Y.

Nevertheless, if we compare rules generated from the same frequent itemset Y, the following theorem
holds for the confidence measure.
Theorem: If a rule X → Y−X does not satisfy the confidence threshold, then any rule X’ → Y−X’, where X’ is a
subset of X, must not satisfy the confidence threshold as well.
Proof: Consider the following two rules: X’ → Y −X’ and X → Y−X, where X’ ⊂ X. The confidence of the rules
are σ(Y )/σ(X’) and σ(Y )/σ(X), respectively. Since X’ is a subset of X, σ(X’) ≥ σ(X). Therefore, the former rule
cannot have a higher confidence than the latter rule.
Rule Generation
Rule Generation in Apriori Algorithm
The Apriori algorithm uses a level-wise approach for generating association rules, where each level corresponds
to the number of items that belong to the rule consequent.
Initially, all the high-confidence rules that have only one item in the rule consequent are extracted. These rules
are then used to generate new candidate rules.
For example, if {acd} → {b} and {abd} → {c} are high-
confidence rules, then the candidate rule {ad} → {bc}
is generated by merging the consequents of both
rules.
The adjacent figure shows a lattice structure for the
association rules generated from the frequent itemset {a,
b, c, d}. If any node in the lattice has low confidence, then
according to theorem (refer previous slide), the entire
subgraph spanned by the node can be pruned
immediately. Suppose the confidence for {bcd} → {a} is
low. All the rules containing item a in its consequent,
including {cd} → {ab}, {bd} → {ac}, {bc} → {ad}, and {d} →
{abc} can be discarded.
Rule Generation
Rule Generation in Apriori Algorithm
Algorithm: Rule generation of the Apriori algorithm.
1: for each frequent k-itemset fk, k ≥ 2 do
2: H1 = {i | i ∈ fk} {1-item consequents of the rule}
3: call ap-genrules(fk,H1) Algorithm: Procedure ap-genrules(fk, Hm)
4: end for 1: k = |f | {size of frequent itemset}
k
2: m = |Hm| {size of rule consequent}
3: if k > m+1 then
4: Hm+1 = apriori-gen(Hm)
5: for each hm+1 ∈ Hm+1 do
6: conf = σ(fk)/σ(fk − hm+1)
7: if conf ≥ minconf then
8: output the rule (fk − hm+1) → hm+1
9: else
10: delete hm+1 from Hm+1
11: end if
12: end for
13: call ap-genrules(fk,Hm+1)
14: end if
Compact Representation of Frequent Itemsets
In practice, the number of frequent itemsets produced from a transaction data set can be very large. Therefore,
it is useful to identify a small representative set of itemsets from which all other frequent itemsets can be
derived. Two such representations are maximal and closed frequent itemsets.
Maximal Frequent Itemsets: A maximal frequent itemset is a frequent itemset for which none of its immediate
supersets are frequent.
Consider the itemset lattice shown in the adjacent
figure. The itemsets in the lattice are divided into two
groups: (i) those that are frequent and (ii) those that
are infrequent. Every itemset located above the border
is frequent, while those located below the border (the
shaded nodes) are infrequent. The border itemsets {a,
d}, {a, c, e}, and {b, c, d, e} are maximal frequent
itemsets because their immediate supersets are
infrequent.
For example, the itemset {a, d} is maximal frequent
because all of its immediate supersets, {a, b, d}, {a,
c, d}, and {a, d, e}, are infrequent whereas {a, c} is
non-maximal because one of its immediate
supersets, {a, c, e}, is frequent.
Compact Representation of Frequent Itemsets
Maximal frequent itemsets provide an effective compact representation of frequent itemsets. They form
the smallest set of itemsets from which all frequent itemsets can be derived.
For example, the frequent itemsets shown in figure in previous slide can be divided into two groups:
 Frequent itemsets that begin with item a and that may contain items c, d, or e. This group includes
itemsets such as {a}, {a, c}, {a, d}, {a, e}, and {a, c, e}.
 Frequent itemsets that begin with items b, c, d, or e. This group includes itemsets such as {b}, {b, c}, {c,
d}, {b, c, d, e}, etc.
Frequent itemsets that belong in the first group are subsets of either {a, c, e} or {a, d}, while those that
belong in the second group are subsets of {b, c, d, e}. Hence, the maximal frequent itemsets {a, c, e}, {a, d},
and {b, c, d, e} provide a compact representation of the frequent itemsets shown in the figure.
Maximal frequent itemsets provide a valuable representation for data sets that can produce very long,
frequent itemsets, as there are exponentially many frequent itemsets in such data.
However, this approach is practical only if an efficient algorithm exists to explicitly find the maximal
frequent itemsets without having to enumerate all their subsets.
Compact Representation of Frequent Itemsets

Despite providing a compact representation, maximal frequent itemsets do not contain the support
information of their subsets.
For example, the support of the maximal frequent itemsets {a, c, e}, {a, d}, and {b,c,d,e} do not provide
any hint about the support of their subsets. An additional pass over the data set is therefore needed to
determine the support counts of the non-maximal frequent itemsets.
Closed Frequent Itemsets
In some cases, it might be desirable to have a minimal representation of frequent itemsets that preserves
the support information.
Closed itemsets provide a minimal representation of itemsets without losing their support information.
Definition (closed itemset): An itemset X is closed if none of its immediate supersets has exactly the same
support count as X. In other words, X is not closed if at least one of its immediate supersets has the same
support count as X.
Compact Representation of Frequent Itemsets
Examples of closed itemsets are shown in the
adjacent figure.
Each node (itemset) in the lattice is associated with a
list of its corresponding transaction IDs to better
illustrate its support count. For example, since the
node {b, c} is associated with transaction IDs 1, 2,
and 3, its support count is equal to three.
From the transactions given in this diagram, we see
that every transaction that contains b also contains
c. Consequently, the support for {b} is identical to {b,
c} and {b} should not be considered a closed itemset.
Similarly, since c occurs in every transaction that
contains both a and d, the itemset {a, d} is not
closed.
On the other hand, {b, c} is a closed itemset because
it does not have the same support count as any of its
supersets.
Compact Representation of Frequent Itemsets
Definition (Closed Frequent Itemset). An itemset is a closed frequent itemset if it is closed and its support is
greater than or equal to minsup.
In the previous example (previous slide), assuming that the support threshold is 40%, {b,c} is a closed frequent
itemset because its support is 60%. The rest of the closed frequent itemsets are indicated by the shaded
nodes.
Algorithms are available to explicitly extract closed frequent itemsets from a given data set.
We can use the closed frequent itemsets to determine the support counts for the non-closed frequent
itemsets. For example, consider the frequent itemset {a, d} shown in the figure (previous slide). Because the
itemset is not closed, its support count must be identical to one of its immediate supersets.
The key is to determine which superset (among {a, b, d}, {a, c, d}, or {a, d, e}) has exactly the same support count
as {a, d}. The Apriori principle states that any transaction that contains the superset of {a, d} must also contain {a,
d}. However, any transaction that contains {a, d} does not have to contain the supersets of {a, d}. For this reason,
the support for {a, d} must be equal to the largest support among its supersets. Since {a, c, d} has a larger
support than both {a, b, d} and {a, d, e}, the support for {a, d} must be identical to the support for {a, c, d}.
Using this methodology, an algorithm can be developed to compute the support for the non-closed frequent
itemsets.
Compact Representation of Frequent Itemsets
The algorithm proceeds in a specific-to-general fashion, i.e., from the largest to the smallest frequent
itemsets. This is because, in order to find the support for a non-closed frequent itemset, the support for all
of its supersets must be known.

Algorithm: Support counting using closed frequent itemsets


1: Let C denote the set of closed frequent itemsets
2: Let kmax denote the maximum size of closed frequent itemsets
3: 𝐹𝑘𝑚𝑎𝑥 = {f|f ∈ C, |f| = kmax} {Find all frequent itemsets of size kmax.}
4: for k = kmax − 1 downto 1 do
5: Fk = {f|f ⊂ Fk+1, |f| = k} {Find all frequent itemsets of size k.}
6: for each f ∈ Fk do
7: if f ∉ C then
8: f.support = max{f’.support|f’ ∈ Fk+1, f ⊂ f’}
9: end if
10: end for
11: end for
Compact Representation of Frequent Itemsets
Advantage of using closed frequent itemsets
Consider the data set shown in the table below, which contains ten transactions and fifteen items. The
items can be divided into three groups:
(1) Group A, which contains items a1 through a5;
(2) Group B, which contains items b1 through b5; and
(3) Group C, which contains items c1 through c5.

Items within each group are perfectly


TID a1 a2 a3 a4 a5 b1 b2 b3 b4 b5 c1 c2 c3 c4 c5
associated with each other and they do not 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
appear with items from another group. 2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
Assuming the support threshold is 20%, the 3 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
total number of frequent itemsets is 3×(25−1) 4 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
= 93. However, there are only three closed 6 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
frequent itemsets in the data: ({a1, a2, a3, a4, 7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
a5}, {b1, b2, b3, b4, b5}, and {c1, c2, c3, c4, 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
c5}). It is often sufficient to present only the 9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
closed frequent itemsets to the analysts
instead of the entire set of frequent itemsets.
Compact Representation of Frequent Itemsets
All maximal frequent itemsets are closed because none of the maximal frequent itemsets can have the
same support count as their immediate supersets. The relationships among frequent, maximal frequent,
and closed frequent itemsets are shown in the figure below.
Compact Representation of Frequent Itemsets

Closed frequent itemsets are useful for removing some of the redundant association rules.
An association rule X → Y is redundant if there exists another rule X’ → Y’, where X is a subset of X’ and
Y is a subset of Y’, such that the support and confidence for both rules are identical.

In the example shown in slide 38, {b} is not a closed frequent itemset while {b, c} is closed. The association
rule {b} → {d, e} is therefore redundant because it has the same support and confidence as {b, c} → {d, e}.
Such redundant rules are not generated if closed frequent itemsets are used for rule generation.
Mining Frequent Itemsets Using the Vertical Data Format
There are many ways to represent a transaction data set. However, the choice of representation can affect
the I/O costs incurred when computing the support of candidate itemsets.
The adjacent figure shows two different ways of representing market basket transactions. The first
representation is called a horizontal data layout (TID-itemset format (i.e., {TID : itemset}), where TID is a
transaction ID and itemset is the set of items bought in transaction TID). It is adopted by many association
rule mining algorithms, including Apriori.
Another possibility is to store the list of transaction identifiers
associated with each item (item-TID set format (i.e., {item : TID
set}), where item is an item name, and TID set is the set of
transaction identifiers containing the item). Such a
representation is known as the vertical data layout.
The support for each candidate itemset is obtained by
intersecting the TID-sets of its subset items. The length of the
TID-sets shrinks as we progress to larger sized itemsets.
However, one problem with this approach is that the initial set of
TID-sets may be too large to fit into main memory, thus requiring
more sophisticated techniques to compress the TID-sets.
Mining Frequent Itemsets Using the Vertical Data Format

Basic Working Principle


First, we transform the horizontally formatted data into the vertical format by scanning the data set once.
The support count of an itemset is simply the length of the TID-set of the itemset.
Starting with k = 1, the frequent k-itemsets can be used to construct the candidate (k+1)-itemsets based on
the Apriori property. The computation is done by intersection of the TID-sets of the frequent k-itemsets to
compute the TID-sets of the corresponding (k+1)-itemsets.
This process repeats, with k incremented by 1 each time, until no frequent itemsets or candidate itemsets
can be found.
Mining Frequent Itemsets Using the Vertical Data Format
Example: Consider a market basket transaction database, D, into the vertical data format as shown in Table 1
below. (A horizontal data format can be transformed into vertical data format by scanning it once.)
Mining can be performed on this data set by intersecting the TID sets of every pair of frequent single items.
Suppose the minimum support count is 2. Because every single item is frequent in Table 1, there are 10
intersections performed in total, which lead to eight nonempty 2-itemsets, as shown in Table 2. Because the
itemsets {I1, I4} and {I3, I5} each contain only one transaction, they do not belong to the set of frequent 2-
itemsets.
Based on the Apriori property, a given 3-itemset is a candidate 3-itemset only if every one of its 2-itemset
subsets is frequent. The candidate generation process here will generate only two 3-itemsets: {I1, I2, I3} and {I1,
I2, I5}. By intersecting the TID sets of any two corresponding 2-itemsets of these candidate 3-itemsets, it derives
Table 3, where there are only two frequent 3-itemsets: {I1, I2, I3: 2} and {I1, I2, I5: 2}.
2-Itemsats in the Vertical Data Format
The Vertical Data Format of the Itemset TID-Set
Transaction Data Set D {I1, I2} {T1, T4, T8, T9}
Itemset TID-Set {I1, I3} {T5, T7, T8, T9} 3-Itemsats in the Vertical Data
I1 {T1, T4, T5, T7, T8, T9} {I1, I4} {T4} Format
I2 {T1, T2, T3, T4, T6, T8, T9} {I1, I5} {T1, T8} Itemset TID-Set
I3 {T3, T5, T6, T7, T8, T9} {I2, I3} {T3, T6, T8, T9} {I1, I2, I3} { T8, T9}
I4 {T2, T4} {I2, I4} {T2, T4} {I1, I2, I5} {T1, T8}
I5 {T1, T8} {I2, I5} {T1, T8}
{I3, I5} {T8}
Mining Frequent Itemsets Using the Vertical Data Format
Advantage 1: It uses the Apriori property in the generation of candidate (K+1)-itemset from frequent k-
itemsets
Advantage 2: There is no need to scan the database to find the support of (K+1)-itemsets (for k ≥ 1). This is
because the TID-set of each k-itemset carries the complete information required for counting such support.
Disadvantage: The TID-sets can be quite long, taking substantial memory space as well as computation
time for intersecting the long sets.
Solution: To educe the cost of registering long TID sets, as well as the subsequent costs of intersections,
we can use a technique called diffset, which keeps track of only the differences of the TID-sets of a
(k+1)-itemset and a corresponding k-itemset.
For example, we have {I1} = {T1, T4, T5, T7, T8, T9} and {I1, I2} = {T1, T4, T8, T9}. The diffset
between the two is diffset({I1, I2}, {I1}) = {T5, T7}. Thus, rather than recording the four TIDs that
make up the intersection of {I1} and {I2}, we can instead use diffset to record just two TIDs,
indicating the difference between {I1} and {I1, I2}.
Experiments show that in certain situations, such as when the data set contains many dense and long
patterns, this technique can substantially reduce the total cost of vertical format mining of frequent
itemsets.
Mining Frequent Itemsets Using FP-Growth Algorithm
The FP-growth algorithm does not subscribe to the generate-and-test paradigm of Apriori. Instead, it
encodes the data set using a compact data structure called an FP-tree and extracts frequent itemsets
directly from this structure.

FP-Tree Representation
An FP-tree is a compressed representation of the input data.
It is constructed by reading the data set one transaction at a time and mapping each transaction onto a
path in the FP-tree.
As different transactions can have several items in common, their paths may overlap.
The more the paths overlap with one another, the more compression we can achieve using the FP-tree
structure.
If the size of the FP-tree is small enough to fit into main memory, we can extract frequent itemsets directly
from the structure in memory instead of making repeated passes over the data stored on disk.
Each node in the tree contains the label of an item along with a counter that shows the number of
transactions mapped onto the given path.
Mining Frequent Itemsets Using FP-Growth Algorithm
Mining Frequent Itemsets Using FP-Growth Algorithm
Initially, the FP-tree contains only the root node represented by the null symbol. The FP-tree is subsequently
extended in the following way:
1. The data set is scanned once to determine the support count of each item. Infrequent items are discarded,
while the frequent items are sorted in decreasing support counts. In our example, a is the most frequent
item, followed by b, c, d, and e.
2. The algorithm makes a second pass over the data to construct the FP-tree. After reading the first
transaction, {a, b}, the nodes labeled as a and b are created. A path is then formed from null → a → b to
encode the transaction. Every node along the path has a frequency count of 1.
3. After reading the second transaction, {b, c, d}, a new set of nodes is created for items b, c, and d. A path is
then formed to represent the transaction by connecting the nodes null → b → c → d. Every node along this
path also has a frequency count equal to one. Although the first two transactions have an item in common,
which is b, their paths are disjoint because the transactions do not share a common prefix.
4. The third transaction, {a, c, d, e}, shares a common prefix item (which is a) with the first transaction. As a
result, the path for the third transaction, null → a → c → d → e, overlaps with the path for the first
transaction, null → a → b. Because of their overlapping path, the frequency count for node a is incremented
to two, while the frequency counts for the newly created nodes, c, d, and e, are equal to one.
5. This process continues until every transaction has been mapped onto one of the paths given in the FP-tree.
Mining Frequent Itemsets Using FP-Growth Algorithm
The size of an FP-tree is typically smaller than the size of the uncompressed data because many
transactions in market basket data often share a few items in common.
In the best-case scenario, where all the transactions have the same set of items, the FP-tree contains only a
single branch of nodes.
The worst-case scenario happens when every transaction has a unique set of items. As none of the
transactions have any items in common, the size of the FP-tree is effectively the same as the size of the
original data. However, the physical storage requirement for the FP-tree is higher because it requires
additional space to store pointers between nodes and counters for each item.

The size of an FP-tree also depends on how the


items are ordered. If the ordering scheme in the
preceding example is reversed, i.e., from lowest to
highest support item, the resulting FP-tree is shown
in the adjacent figure. The tree appears to be denser
because the branching factor at the root node has
increased from 2 to 5 and the number of nodes
containing the high support items such as a and b
has increased from 3 to 12.
Mining Frequent Itemsets Using FP-Growth Algorithm
However, ordering by decreasing support counts does not always lead to the smallest tree. For example,
suppose we augment the data set given in our example with 100 transactions that contain {e}, 80
transactions that contain {d}, 60 transactions that contain {c}, and 40 transactions that contain {b}.
Item e is now most frequent, followed by d, c, b, and a. With the augmented transactions, ordering by
decreasing support counts will result in an FP-tree similar to the figure in previous slide (i.e., 51), while a
scheme based on increasing support counts produces a smaller FP-tree similar to the figure in slide 49.
An FP-tree also contains a list of pointers connecting between nodes that have the same items. These
pointers, represented as dashed lines in the figures, help to facilitate the rapid access of individual items in
the tree.

Frequent Itemset Generation in FP-Growth Algorithm


FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring the tree in a
bottom-up fashion.
In our example, the algorithm looks for frequent itemsets ending in e first, followed by d, c, b, and finally, a.
This bottom-up strategy for finding frequent itemsets ending with a particular item is known as the suffix-
based approach.
Mining Frequent Itemsets Using FP-Growth Algorithm
Since every transaction is mapped onto a
path in the FP-tree, we can derive the
frequent itemsets ending with a particular
item, say, e, by examining only the paths
containing node e. These paths can be
accessed rapidly using the pointers associated
with node e. The extracted paths are shown
in figure (a).
After finding the frequent itemsets ending in
e, the algorithm proceeds to look for frequent
itemsets ending in d by processing the paths
associated with node d. The corresponding
paths are shown in figure (b).
This process continues until all the paths
associated with nodes c, b, and finally a, are
processed. The paths for these items are
shown in figures (c), (d), and (e), respectively.
Mining Frequent Itemsets Using FP-Growth Algorithm

The corresponding frequent itemsets are summarized in the table below.

Table: The list of frequent itemsets ordered by their corresponding suffixes.


Suffix Frequent Itemsets
e {e}, {d,e}, {a,d,e}, {c,e},{a,e}
d {d}, {c,d}, {b,c,d}, {a,c,d}, {b,d}, {a,b,d}, {a,d}
c {c}, {b,c}, {a,b,c}, {a,c}
b {b}, {a,b}
a {a}
Mining Frequent Itemsets Using FP-Growth Algorithm

FP-growth finds all the frequent itemsets ending with a particular suffix by employing a divide-and-conquer
strategy to split the problem into smaller subproblems.

For example, suppose we are interested in finding all frequent itemsets ending in e. To do this
 We must first check whether the itemset {e} itself is frequent.
 If it is frequent, we consider the subproblem of finding frequent itemsets ending in de, followed by ce,
be, and ae.
 In turn, each of these subproblems are further decomposed into smaller subproblems.
 By merging the solutions obtained from the subproblems, all the frequent itemsets ending in e can be
found.

This divide-and-conquer approach is the key strategy employed by the FP-growth algorithm.
Mining Frequent Itemsets Using FP-Growth Algorithm
Mining Frequent Itemsets Using FP-Growth Algorithm
On how to solve the subproblems, consider the task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial paths are called prefix paths and
are shown in figure (a).
2. From the prefix paths shown in figure (a), the support count for e is obtained by adding the support
counts associated with node e. Assuming that the minimum support count is 2, {e} is declared a
frequent itemset because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent itemsets
ending in de, ce, be, and ae. Before solving these subproblems, it must first convert the prefix paths
into a conditional FP-tree, which is structurally similar to an FP-tree, except it is used to find frequent
itemsets ending with a particular suffix. A conditional FP-tree is obtained in the following way:
a) First, the support counts along the prefix paths must be updated because some of the counts
include transactions that do not contain item e. For example, the rightmost path shown in figure
(a), null → b:2 → c:2 → e:1, includes a transaction {b, c} that does not contain item e. The counts
along the prefix path must therefore be adjusted to 1 to reflect the actual number of transactions
containing {b, c, e}.
b) The prefix paths are truncated by removing the nodes for e. These nodes can be removed because
the support counts along the prefix paths have been updated to reflect only transactions that
contain e and the subproblems of finding frequent itemsets ending in de, ce, be, and ae no longer
need information about node e.
Mining Frequent Itemsets Using FP-Growth Algorithm
c) After updating the support counts along the prefix paths, some of the items may no longer be
frequent. For example, the node b appears only once and has a support count equal to 1, which
means that there is only one transaction that contains both b and e. Item b can be safely ignored
from subsequent analysis because all itemsets ending in be must be infrequent.
The conditional FP-tree for e is shown in Figure (b). The tree looks different than the original prefix
paths because the frequency counts have been updated and the nodes b and e have been eliminated.
4. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding frequent itemsets
ending in de, ce, and ae. To find the frequent itemsets ending in de, the prefix paths for d are gathered
from the conditional FP-tree for e (figure (c)). By adding the frequency counts associated with node d,
we obtain the support count for {d, e}. Since the support count is equal to 2, {d, e} is declared a
frequent itemset. Next, the algorithm constructs the conditional FP-tree for de using the approach
described in step 3. After updating the support counts and removing the infrequent item c, the
conditional FP-tree for de is shown in figure (d). Since the conditional FP-tree contains only one item, a,
whose support is equal to minsup, the algorithm extracts the frequent itemset {a, d, e} and moves on
to the next subproblem, which is to generate frequent itemsets ending in ce. After processing the
prefix paths for c, only {c, e} is found to be frequent. The algorithm proceeds to solve the next
subprogram and found {a, e} to be the only frequent itemset remaining.
Mining Frequent Itemsets Using FP-Growth Algorithm

At each recursive step, a conditional FP-tree is constructed by updating the frequency counts along the
prefix paths and removing all infrequent items. Because the subproblems are disjoint, FP-growth will not
generate any duplicate itemsets. In addition, the counts associated with the nodes allow the algorithm to
perform support counting while generating the common suffix itemsets.
FP-growth illustrates how a compact representation of the transaction data set helps to efficiently generate
frequent itemsets. In addition, for certain transaction data sets, FP-growth outperforms the standard
Apriori algorithm by several orders of magnitude. The run-time performance of FP-growth depends on the
compaction factor of the data set. If the resulting conditional FP-trees are very bushy (in the worst case, a
full prefix tree), then the performance of the algorithm degrades significantly because it has to generate a
large number of subproblems and merge the results returned by each subproblem.
Evaluation of Association Patterns
Association analysis algorithms have the potential to generate a large number of patterns. As the size and
dimensionality of real commercial databases can be very large, we can easily end up with thousands or
even millions of patterns, many of which might not be interesting.
However, identifying the most interesting patterns is not a trivial task because “one person’s trash might be
another person’s treasure”. Therefore, it is important to establish a set of well-accepted criteria for
evaluating the quality of association patterns.
The first set of criteria can be established through statistical arguments. Patterns that involve a set of
mutually independent items or cover very few transactions are considered uninteresting because they may
capture spurious relationships in the data. Such patterns can be eliminated by applying an objective
interestingness measure that uses statistics derived from data to determine whether a pattern is
interesting. Examples of objective interestingness measures include support, confidence, and correlation.
The second set of criteria can be established through subjective arguments. A pattern is considered
subjectively uninteresting unless it reveals unexpected information about the data or provides useful
knowledge that can lead to profitable actions. For example, the rule {Butter} → {Bread} may not be
interesting, despite having high support and confidence values, because the relationship represented by
the rule may seem rather obvious. On the other hand, the rule {Diapers} → {Beer} is interesting because the
relationship is quite unexpected and may suggest a new cross-selling opportunity for retailers.
Evaluation of Association Patterns
However, incorporating subjective knowledge into pattern evaluation is a difficult task because it requires a
considerable amount of prior information from the domain experts.

Some of the approaches for incorporating subjective knowledge into the pattern discovery task
Visualization: This approach requires a user-friendly environment to keep the human user in the loop. It
also allows the domain experts to interact with the data mining system by interpreting and verifying the
discovered patterns.
Template-based approach: This approach allows the users to constrain the type of patterns extracted by
the mining algorithm. Instead of reporting all the extracted rules, only rules that satisfy a user-specified
template are returned to the users.
Subjective interestingness measure: A subjective measure can be defined based on domain information
such as concept hierarchy or profit margin of items. Such measures can then be used to filter patterns that
are obvious and non-actionable.
Evaluation of Association Patterns
Objective Measures of Interestingness
An objective measure is a data-driven approach for evaluating the quality of association patterns. Other
than specifying a threshold for filtering low-quality patterns, It is domain-independent and requires
minimal input from the users. An objective measure is usually computed based on the frequency counts
tabulated in a contingency table.
The adjacent table shows an example of a contingency table for a pair of binary variables, A and B.

The 𝐴ҧ (𝐵)
ത are used to indicate that A (B) is absent from a transaction.
Each entry fij in this table denotes a frequency count. For example, f11 is the B 𝐵ത
number of times A and B appear together in the same transaction, while f01 is the A f11 f10 f1+
number of transactions that contain B but not A.
𝐴ҧ f01 f00 f0+
The row sum f1+ represents the support count for A, while the column sum f+1 f+1 f+0 N
represents the support count for B.
The contingency tables are also applicable to other attribute types such as
symmetric binary, nominal, and ordinal variables.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
Existing association rule mining formulation relies on the support and confidence measures to eliminate
uninteresting patterns. We have already discussed the drawback of support. The drawback of confidence is
more subtle.
Example: Suppose we are interested in analyzing the relationship between people who drink tea and
coffee. The gathered information is shown in the adjacent contingency table. Let us evaluate the
association rule {Tea} →{Coffee}.
At first glance, it may appear that people who drink tea also tend to drink coffee because the rule’s support
(15%) and confidence (75%) values are reasonably high.
This argument would have been acceptable except that the fraction of people who drink coffee, regardless
of whether they drink tea, is 80%, while the fraction of tea drinkers who drink coffee is only 75%.

Thus knowing that a person is a tea drinker actually decreases her Coffee 𝐂𝐨𝐟𝐟𝐞𝐞
probability of being a coffee drinker from 80% to 75%! The rule {Tea} → Tea 150 50 200
{Coffee} is therefore misleading despite its high confidence value. 𝐓𝐞𝐚 650 150 800
800 200 1000
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
The pitfall of confidence is that it ignores the support of the itemset in the rule consequent.
In the previous example, if the support of coffee drinkers is taken into account, we would not be surprised
to find that many of the people who drink tea also drink coffee. More surprisingly, the fraction of tea
drinkers who drink coffee is actually less than the overall fraction of people who drink coffee, which points
to an inverse relationship between tea drinkers and coffee drinkers.
Because of the limitations in the support-confidence framework, various objective measures have been
used to evaluate the quality of association patterns.
To tackle this weakness, a correlation measure can be used to augment the support–confidence framework
for association rules. This leads to correlation rules of the form
A → B [support, confidence, correlation]
That is, a correlation rule is measured not only by its support and confidence but also by the correlation
between itemsets A and B.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
Lift is a simple correlation measure defined as follows. The occurrence of itemset A is independent of the
occurrence of itemset B if P(A ∪ B) = P(A)P(B); otherwise, itemsets A and B are dependent and correlated
as events. This definition can easily be extended to more than two itemsets. The lift between the
occurrence of A and B can be measured by computing
𝑃(𝐴∪𝐵)
Lift(A, B) =
𝑃 𝐴 𝑃(𝐵)

If the resulting value of the equation is < 1, then the occurrence of A is negatively correlated with the
occurrence of B, meaning that the occurrence of one likely leads to the absence of the other one.
If the resulting value is > 1, then A and B are positively correlated, meaning that the occurrence of one
implies the occurrence of the other.
If the resulting value is equal to 1, then A and B are independent and there is no correlation between them.

The above equation is equivalent to P(B|A)/P(B), or conf(A → B)/sup(B), which is also referred to as the lift
of the association (or correlation) rule A → B.

The other important correlation matrix is the chi square (X2).


Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
Interest Factor: For binary variables, lift is equivalent to another objective measure called interest factor,
which is defined as follows:
s(A,B) Nf
I(A,B) = = 11
s(A) × s(B) f1+ f+1

Interest factor compares the frequency of a pattern against a baseline frequency computed under the
statistical independence assumption. The baseline frequency for a pair of mutually independent variables is
f11 f1+ f+1 f1+ f+1
= × or equivalently f11 =
𝑁 𝑁 𝑁 𝑁
This equation follows from the standard approach of using simple fractions as estimates for probabilities.
The fraction f11/N is an estimate for the joint probability P(A,B), while f1+/N and f+1/N are the estimates for
P(A) and P(B), respectively. If A and B are statistically independent, then P(A,B) = P(A) × P(B), thus leading to
the formula shown above. Using the above equations, we can interpret the measure as follows:
I(A,B) = 1, if A and B are independent;
> 1, if A and B are positively correlated;
< 1, if A and B are negatively correlated.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework
Correlation Analysis: Correlation analysis is a statistical-based technique for analyzing relationships
between a pair of variables. For continuous variables, correlation is defined using Pearson’s correlation
coefficient. For binary variables, correlation can be measured using the ∅-coefficient, which is defined as
f11f00 − f01f10
∅=
f1+ f+1 f0+ f+0

The value of correlation ranges from −1 (perfect negative correlation) to +1 (perfect positive correlation). If
the variables are statistically independent, then ∅ = 0.
IS Measure: IS is an alternative measure that has been proposed for handling asymmetric binary variables.
The measure is defined as follows:
s(A,B)
IS(A,B) = I(A,B) × s(A,B) =
s(A)s(B)
IS is large when the interest factor and support of the pattern are large.
Evaluation of Association Patterns
Limitations of the Support-Confidence Framework

It is possible to show that IS is mathematically equivalent to the cosine measure for binary variables. In this
regard, we consider A and B as a pair of bit vectors, A • B = s(A,B) the dot product between the vectors, and
|A| = s(A) the magnitude of vector A. Therefore:
s(A,B) A•B
IS(A,B) = = = cosine(A,B)
s(A)s(B) |A| × |B|

The IS measure can also be expressed as the geometric mean between the confidence of association rules
extracted from a pair of binary variables:
s(A,B) s(A,B)
IS(A,B) = × = c(A → B) × c(B → A)
s(A) s(B)

Because the geometric mean between any two numbers is always closer to the smaller number, the IS
value of an itemset {p, q} is low whenever one of its rules, p → q or q → p, has low confidence.
Effect of skewed Support Distribution
The performances of many association analysis algorithms are influenced by properties of their input data.
For example, the computational complexity of the Apriori algorithm depends on properties such as the
number of items in the data and average transaction width.
The skewed support distribution in the data set (where most of the items have relatively low to moderate
frequencies, but a small number of them have very high frequencies) also has significant influence on the
performance of association analysis algorithms as well as the quality of extracted patterns.
An example of such a real data set is shown in the adjacent
figure. This data, taken from the PUMS (Public Use Microdata
Sample) census data, contains 49,046 records and 2113
asymmetric binary variables.
We shall treat the asymmetric binary variables as items and
records as transactions. While more than 80% of the items have
support less than 1%, a handful of them have support greater
than 90%.
Consider the grouping of the items in the census data set, Group G1 G2 G3
based on their support values, as shown in the adjacent Support < 1% 1% - 90% > 90%
Number of Items 1735 358 20
table.
Effect of skewed Support Distribution
Choosing the right support threshold for mining this data set (i.e., skewed data sets) can be quite tricky. If we set the
threshold too high (e.g., 20%), then we may miss many interesting patterns involving the low support items from G1.
In market basket analysis, such low support items may correspond to expensive products (such as jewelry) that
are seldom bought by customers, but whose patterns are still interesting to retailers.
Conversely, when the threshold is set too low, it becomes difficult to find the association patterns due to the following
reasons.
 First, the computational and memory requirements of existing association analysis algorithms increase
considerably with low support thresholds.
 Second, the number of extracted patterns also increases substantially with low support thresholds.
 Third, we may extract many spurious patterns that relate a high-frequency item such as milk to a low-frequency
item such as caviar. Such patterns, which are called cross-support patterns, are likely to be spurious because their
correlations tend to be weak.
For example, at a support threshold equal to 0.05%, there are 18,847 frequent pairs involving items from G1
and G3. Out of these, 93% of them are cross-support patterns; i.e., the patterns contain items from both G1
and G3. The maximum correlation obtained from the cross-support patterns is 0.029, which is much lower
than the maximum correlation obtained from frequent patterns involving items from the same group (which
is as high as 1.0).
This example shows that a large number of weakly correlated cross-support patterns can be generated when the
support threshold is sufficiently low.
Effect of skewed Support Distribution

Definition (Cross-Support Pattern): A cross-support pattern is an itemset X = {i1, i2, . . . , ik} whose support
ratio
min[s(i1), s(i2), . . . , s(ik)]
r(X) =
max[s(i1), s(i2), . . . , s(ik)]
is less than a user-specified threshold hc.

Example: Suppose the support for milk is 70%, while the support for sugar is 10% and caviar is 0.04%. Given
hc = 0.01, the frequent itemset {milk, sugar, caviar} is a cross-support pattern because its support ratio is

min[0.7, 0.1, 0.0004] 0.0004


r= = = 0.00058 < 0.01.
max[0.7, 0.1, 0.0004 ] 0.7
Effect of skewed Support Distribution
Existing measures such as support and confidence may not be sufficient
to eliminate cross-support patterns.
In the adjacent figure, assuming hc = 0.3, the itemsets {p, q}, {p, r}, and
{p, q, r} are cross-support patterns because their support ratios, which
are equal to 0.2, are less than the threshold hc.
If we apply a high support threshold, say, 20%, to eliminate the cross-
support patterns, it comes at the expense of discarding other interesting
patterns such as the strongly correlated itemset, {q, r} that has support
equal to 16.7%.
Confidence pruning also does not help because the confidence of the
rules extracted from cross-support patterns can be very high.
For example, the confidence for {q} → {p} is 80% even though {p, q}
is a cross-support pattern.
The fact that the cross-support pattern can produce a high-confidence
rule should not come as a surprise because one of its items (p) appears
very frequently in the data. Therefore, p is expected to appear in many A transaction data set containing three
of the transactions that contain q. items, p, q, and r, where p is a high support
item and q and r are low support items.
Effect of skewed Support Distribution

The rule {q} → {r} also has high confidence even though {q, r} is not a
cross-support pattern.
This example demonstrates the difficulty of using the confidence
measure to distinguish between rules extracted from cross-support
and non-cross-support patterns.
Returning to the example, we notice that the rule {p} → {q} has very
low confidence because most of the transactions that contain p do
not contain q. In contrast, the rule {r} → {q}, which is derived from
the pattern {q, r}, has very high confidence.
This observation suggests that cross-support patterns can be
detected by examining the lowest confidence rule that can be
extracted from a given itemset.

A transaction data set containing three items, p,


q, and r, where p is a high support item and q
and r are low support items.
Effect of skewed Support Distribution
1. Recall the following anti-monotone property of confidence:
conf({i1 i2} → {i3, i4, . . . , ik}) ≤ conf({i1 i2 i3} → {i4, i5, . . . , ik})
This property suggests that confidence never increases as we shift more items from the left- to the
right-hand side of an association rule. Because of this property, the lowest confidence rule extracted
from a frequent itemset contains only one item on its left-hand side. We denote the set of all rules with
only one item on its left-hand side as R1.

2. Given a frequent itemset {i1, i2, . . . , ik}, the rule


{ij} → {i1, i2, . . . , ij−1, ij+1, . . . , ik}
has the lowest confidence in R1 if s(ij) = max[s(i1), s(i2), . . . , s(ik)]. This follows directly from the
definition of confidence as the ratio between the rule’s support and the support of the rule antecedent.
3. Summarizing the previous points, the lowest confidence attainable from a frequent itemset {i1, i2, . . . ,
ik} is
s({i1, i2, . . . , ik})
=
max[s(i1), s(i2), . . . , s(ik)]
This expression is also known as the h-confidence or all-confidence measure.
Effect of skewed Support Distribution
Because of the anti-monotone property of support, the numerator of the h-confidence measure is bounded by
the minimum support of any item that appears in the frequent itemset. In other words, the h-confidence of an
itemset X = {i1, i2, . . . , ik} must not exceed the following expression:
min[s(i1), s(i2), . . . , s(ik)]
h-confidence(X) ≤
max[s(i1), s(i2), . . . , s(ik)]

Note the equivalence between the upper bound of h-confidence and the support ratio (r) in the definition (Cross-
Support Pattern). Because the support ratio for a cross-support pattern is always less than hc, the h-confidence of
the pattern is also guaranteed to be less than hc.
Therefore, cross-support patterns can be eliminated by ensuring that the h-confidence values for the patterns
exceed hc.
The h-confidence measure is also anti-monotone, i.e., h-confidence({i1, i2, . . . , ik}) ≥ h-confidence({i1, i2, . . . ,
ik+1}), and thus can be incorporated directly into the mining algorithm.
The h-confidence also ensures that the items contained in an itemset are strongly associated with each other.
For example, suppose the h-confidence of an itemset X is 80%. If one of the items in X is present in a
transaction, there is at least an 80% chance that the rest of the items in X also belong to the same
transaction. Such strongly associated patterns are called hyperclique patterns.

You might also like