Unit IV DWDM
Unit IV DWDM
Association rule mining is a technique in data mining that aims to discover interesting relationships or
patterns within large datasets. It is particularly useful for uncovering associations between different
variables or items in transactional databases, such as items frequently purchased together in a retail
setting or symptoms co-occurring in medical records.
Here are some key concepts and steps involved in association rule mining:
1. Transaction Database:
• Association rule mining typically starts with a transactional database, where each
transaction represents a set of items. For example, a transaction could be a customer's
shopping basket containing various products.
2. Support:
3. Confidence:
4. Association Rules:
• Association rules are typically expressed in the form "IF {antecedent} THEN
{consequent}." For example, "IF {diapers} THEN {baby formula}." These rules are derived
based on the support and confidence thresholds set by the user.
5. Apriori Algorithm:
• The Apriori algorithm is a popular algorithm for association rule mining. It uses a
breadth-first search strategy to discover frequent itemsets and generate association
rules efficiently. The algorithm relies on the Apriori property, which states that if an
itemset is frequent, all of its subsets must also be frequent.
6. Pruning:
• To optimize the process of rule discovery, pruning techniques are often employed to
eliminate itemsets or rules that do not meet certain criteria, such as minimum support
or confidence thresholds.
7. Lift:
• Lift is another measure used in association rule mining. It compares the likelihood of the
consequent occurring when the antecedent is present to the likelihood of the
consequent occurring in general. A lift value greater than 1 indicates a positive
correlation between the antecedent and consequent.
Association rule mining is widely used in various fields, including retail, healthcare, and finance, to
uncover hidden patterns in large datasets. However, it's essential to interpret the results carefully, as
associations do not imply causation, and some discovered rules may be spurious or coincidental.
Frequent itemset generation is a crucial step in association rule mining, particularly in algorithms like
Apriori. The goal is to identify sets of items that occur together frequently in a dataset. Here's an
overview of the process:
1. Support Count:
• The support count of an itemset is the number of transactions in which the itemset
appears. It is the basis for determining the frequency of itemsets. For example, if you're
working with a retail dataset, the support count of an itemset {A, B} is the number of
transactions containing both items A and B.
2. Support Threshold:
• The support threshold is a user-defined parameter that sets the minimum support count
or percentage required for an itemset to be considered "frequent." Itemsets that meet
or exceed this threshold are considered candidates for further analysis.
3. Apriori Algorithm:
• The Apriori algorithm is a classic algorithm for generating frequent itemsets. It uses a
level-wise, breadth-first search strategy to discover frequent itemsets of increasing size.
The algorithm relies on the Apriori property, which states that if an itemset is frequent,
all of its subsets must also be frequent.
4. Algorithm Steps:
• Pruning: Eliminate candidate itemsets that do not meet the minimum support
threshold.
• Repeat: Repeat the process until no new frequent itemsets can be found.
5. Example:
• T1: {A, B, C}
• T2: {A, B}
• T3: {A, C}
• T4: {B, C}
• T5: {B}
• With a minimum support threshold of 2, the initial frequent 1-itemsets are {A}, {B}, and
{C}. Then, the algorithm iteratively generates and prunes candidate itemsets of higher
sizes until no more frequent itemsets can be found.
6. Performance Optimization:
• To improve efficiency, the Apriori algorithm often uses techniques such as pruning
(eliminating candidates with infrequent subsets) and the hash tree structure for
counting support.
Frequent itemset generation is a fundamental step in association rule mining, providing the basis for
discovering meaningful patterns and relationships in large datasets.
The Apriori principle is based on the observation that if an itemset is frequent, then all of its subsets
must also be frequent. This principle is often expressed as:
Conversely:
This principle is crucial for efficiently identifying frequent itemsets in a large dataset without having to
examine all possible combinations. The Apriori algorithm uses the Apriori principle to generate
candidate itemsets and prune those that cannot be frequent based on the downward closure property.
Here's how the Apriori principle is applied in the context of the Apriori algorithm:
• Initially, the algorithm identifies frequent 1-itemsets (individual items) by counting their
support in the dataset.
2. Generate Candidate Itemsets:
• For subsequent iterations, the algorithm generates candidate itemsets of size k based
on frequent itemsets of size k-1.
• Before counting the support of candidate itemsets, the algorithm prunes candidates
that have infrequent subsets. This pruning step is possible due to the Apriori principle,
which ensures that if a candidate itemset is infrequent, any of its supersets (larger
itemsets) will also be infrequent.
• After pruning, the algorithm counts the support of the remaining candidate itemsets in
the dataset. Frequent itemsets are retained, and the process is repeated until no new
frequent itemsets can be found.
By leveraging the Apriori principle and the downward closure property, the Apriori algorithm efficiently
explores the search space of potential frequent itemsets, avoiding the need to examine all possible
combinations and reducing the computational cost of association rule mining.
APRIORI ALGORITHM
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset
for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties. We apply an iterative approach or level-wise search where k-frequent
itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its
anti-monotonicity of support measure. Apriori assumes that
Before we start understanding the algorithm, go through some definitions which are explained in my
previous post.
Consider the following dataset and we will find frequent itemsets and generate association rules for
them.
minimum support count is 2
minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us
itemset L1.
Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.
Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2,
I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2,
I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining
L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset
in C4
Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes
into picture. For that we need to calculate confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of candidate
sets with much frequent itemsets, low minimum support or large itemsets i.e. it is not an efficient
approach for large number of datasets. For example, if there are 10^4 from frequent 1- itemsets, it need
to generate more than 10^7 candidates into 2-length which in turn they will be tested and accumulate.
Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have to generate 2^100
candidate itemsets that yield on costly and wasting of time of candidate generation. So, it will check for
many sets from candidate itemsets, also it will scan database many times repeatedly for finding
candidate itemsets. Apriori will be very low and inefficiency when memory capacity is limited with large
number of transactions.
RULE GENERATION
Rule generation in the context of association rule mining, specifically using algorithms like Apriori,
involves deriving meaningful relationships or patterns from the discovered frequent itemsets. Once
frequent itemsets are identified, association rules are generated to express the associations between
items. These rules are in the form "IF {antecedent} THEN {consequent}" and provide insights into the co-
occurrence of items in the dataset.
• Before generating rules, you need to identify frequent itemsets using an algorithm such
as Apriori. Frequent itemsets are sets of items that occur together frequently in the
dataset.
2. Rule Generation:
3. Rule Evaluation:
• Evaluate the quality of each rule using metrics such as support, confidence, and lift.
These metrics help determine the significance and reliability of the discovered
associations.
• Support: The proportion of transactions in the dataset that contain both the
antecedent and the consequent.
•
• Lift: Measures the degree to which the antecedent and consequent are
dependent, considering their individual occurrences.
4. Pruning Rules:
• Present the generated rules in a human-readable format, making it easy for users to
understand the relationships between different items. This might involve sorting the
rules based on confidence or support.
6. Iterative Refinement:
• Depending on the specific goals of the analysis, you may need to iteratively refine the
rule generation process by adjusting parameters, such as support and confidence
thresholds, or by considering additional domain-specific constraints.
7. Rule Application:
• Once you have a set of high-quality rules, you can apply them to new data to make
predictions or gain insights. For example, in a retail setting, if you discover a rule like "IF
{bread} THEN {butter}," it suggests that customers who buy bread are likely to buy
butter as well.
It's important to note that association rules do not imply causation, and the interpretation of rules
should be done cautiously. Additionally, the effectiveness of rules depends on the quality of the data
and the appropriateness of the algorithm and parameters chosen for rule generation.
1. Closed Itemsets:
2. Maximal Itemsets:
• Maximal itemsets are another compact representation that retains only those itemsets
that are not subsets of any other frequent itemset. Unlike closed itemsets, maximal
itemsets do not consider support; they only focus on the structure of the itemsets.
Maximal itemsets provide a more compact representation by excluding subsets that do
not add new information.
3. Association Rules:
• Instead of storing all frequent itemsets separately, one can store only the high-
confidence association rules. These rules capture the essential relationships between
items in a more human-readable and actionable format. The compactness comes from
representing associations rather than individual itemsets.
4. Tree-Based Structures:
• Some methods use tree-based structures, such as FP-growth (Frequent Pattern growth),
to compactly represent frequent itemsets. FP-growth builds a compressed data
structure called the FP-tree, which facilitates efficient mining of frequent itemsets
without the need to explicitly generate and store all possible itemsets.
5. Bitwise Representations:
• In databases where items have unique identifiers, bitwise representations can be used
to compactly represent itemsets. Each item corresponds to a bit, and itemsets are
represented as bit vectors, making it computationally efficient for certain operations.
• In some cases, representing data in a vertical format where each item has its list of
transactions can lead to a more compact representation, especially when dealing with
sparse datasets.
The choice of compact representation depends on the specific requirements of the analysis, the
characteristics of the dataset, and the goals of the mining process. Each compact representation method
has its advantages and trade-offs in terms of storage efficiency, computational complexity, and ease of
interpretation.
The FP-Growth (Frequent Pattern Growth) algorithm is an efficient algorithm for mining frequent
itemsets from transactional databases. It was introduced by Jiawei Han, Jian Pei, and Yiwen Yin in their
paper "Mining Frequent Patterns without Candidate Generation" in 2000. FP-Growth is particularly well-
suited for large datasets and is an alternative to the Apriori algorithm.
• Scan the transactional database and construct a data structure called the FP-Tree
(Frequent Pattern Tree).
• The FP-Tree is built by inserting each transaction into the tree. Items within a
transaction are added as nodes, and the tree structure helps represent the relationships
between different items.
• For each frequent item in the dataset, create a conditional pattern base by removing the
frequent item from the original transactions and keeping the remaining structure. This
step is performed recursively.
• For each conditional pattern base, build a conditional FP-Tree. This is essentially a
smaller FP-Tree constructed from the conditional pattern base.
• Recursively mine frequent itemsets from each conditional FP-Tree. This process involves
repeating the steps of building conditional pattern bases and constructing conditional
FP-Trees until no more frequent itemsets can be found.
• Combine the frequent itemsets obtained from the conditional FP-Trees with the
frequent itemsets from the original transactions to obtain the complete set of frequent
itemsets.
• No Candidate Generation:
• Unlike the Apriori algorithm, FP-Growth does not generate candidate itemsets explicitly.
It constructs the FP-Tree directly from the dataset, avoiding the need to generate and
test multiple candidate itemsets.
• Efficiency:
• FP-Growth can be more efficient than traditional algorithms, especially when dealing
with large datasets, as it reduces the number of passes over the data and avoids the
generation of an explicit candidate set.
• The FP-Tree is a compact data structure that captures the frequency information in a
condensed form, making it efficient for frequent pattern mining.
While FP-Growth is generally efficient, its performance depends on the characteristics of the dataset. It
is well-suited for datasets with a large number of transactions and a relatively small number of unique
items.
FP-GROWTH ALGORITHM
The two primary drawbacks of the Apriori Algorithm are:
2. To build the candidate sets, the algorithm has to repeatedly scan the database.
These two properties inevitably make the algorithm slower. To overcome these redundant steps, a new
association-rule mining algorithm was developed named Frequent Pattern Growth Algorithm. It
overcomes the disadvantages of the Apriori algorithm by storing all the transactions in a Trie Data
Structure. Consider the following data:-
The above-given data is a hypothetical dataset of transactions with each letter representing an item. The
frequency of each individual item is computed:-
Let the minimum support be 3. A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are stored in descending
order of their respective frequencies. After insertion of the relevant items, the set L looks like this:-
L = { K : 5, E : 4, M : 3, O : 3, Y : 3 }
Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the Frequent
Pattern set and checking if the current item is contained in the transaction in question. If the current
item is contained, the item is inserted in the Ordered-Item set for the current transaction. The following
table is built for all the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
Here, all the items are simply linked one after the other in the order of occurrence in the set and
initialize the support count for each item as 1.
Till the insertion of the elements K and E, simply the support count is increased by 1. On inserting O we
can see that there is no direct link between E and O, therefore a new node for the item O is initialized
with the support count as 1 and item E is linked to this new node. On inserting Y, we first initialize a new
node for the item Y with support count as 1 and link the new node of O with the new node of Y.
c) Inserting the set {K, E, M}:
Similar to step b), first the support count of K is increased, then new nodes for M and Y are initialized
and linked accordingly.
e) Inserting the set {K, E, O}:
Here simply the support counts of the respective elements are increased. Note that the support count of
the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path labels of all the paths which
lead to any node of the given item in the frequent-pattern tree. Note that the items in the below table
are arranged in the ascending order of their frequencies.
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of
elements that is common in all the paths in the Conditional Pattern Base of that item and calculating its
support count by summing the support counts of all the paths in the Conditional Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the
items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the
below table.
For each row, two types of association rules can be inferred for example for the first row which contains
the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule, the confidence of
both the rules is calculated and the one with confidence greater than or equal to the minimum
confidence value is retained.