Association Rule Mining
Association Rule Mining
Association Rule Mining
The purchasing of one product when another product is purchased represents an association
rule. Association rules are frequently used by retail stores to assist in marketing, advertising,
floor placement, and inventory control. Although they have direct applicability to retail
businesses, they have been used for other purposes as well, including predicting faults in
telecommunication networks. Association rules are used to show the relationships between
data items. These uncovered relationships are not inherent in the data, as with functional
dependencies, and they do not represent any sort of causality or correlation. Instead,
association rules detect common usage of items. Example 2. 1 illustrates this
Example 2.1
A grocery store chain keeps a record of weekly transactions where each transaction represents
the items bought during one cash register transaction. The executives of the chain receive a
summarized report of the transactions indicating what types of items have sold at what
quantity. In addition, they periodically request information about what items are commonly
purchased together. They find that 100% of the time that Peanut-Butter is purchased, so is
Bread. In addition, 33.3% of the time Peanut-Butter is purchased, Jelly is also purchased.
However, Peanut-Butter exists in only about 50% of the overall transactions.
The table below shows a sample of five transactions and five items: {Beer, Bread, Jelly,
Milk, Peanut- Butter}.
Association rules mining can be applied to the above dataset to perform a market basket analysis.
i.e determining items that are frequently bought together.
Confidence, on the other hand, measures the strength of the relationship between the
antecedent and consequent items. It is defined as the proportion of transactions that
contain both the antecedent and consequent items, out of the total number of
transactions that contain the antecedent item. In other words, it measures the
conditional probability that the consequent item will occur in a transaction given that
the antecedent item is already present in that transaction. For example, if a confidence
value of 0.8 is set for the rule X --> Y, it means that 80% of the transactions that
contain item X also contain item Y.
Also, minimum support and minimum confidence values are important metrics. These
are user-specified thresholds that can be used to filter out frequent itemsets or
association rules that are deemed too rare or weak. For example, if the minimum
support threshold is set to 0.1, only itemsets that appear in at least 10% of the
transactions will be considered frequent. Similarly, if the minimum confidence threshold
for a rule X --> Y is set to 0.7, only rules for which at least 70% of the transactions
containing X also contain Y will be considered valid.
Apriori Algorithm
The Apriori algorithm is a classic and widely used data mining technique for discovering
frequent itemsets in transactional datasets. It's primarily used in market basket analysis, which
involves finding associations between items that are often purchased together. This algorithm
helps retailers and businesses understand customer purchasing patterns and make decisions
related to product placement, promotion strategies, and more.
To practice Association Rule mining using Apriori Algorithm and Python Programming follow
the link below.
https://youtu.be/r-vymRtEzN8
FP-GROWTH ALGORITHM
The FP-Growth algorithm is an association rule mining algorithm that is used to find frequent
itemsets in a large database. It is a more efficient alternative to the Apriori algorithm, as it
does not require the generation of candidate itemsets.
The FP-Growth algorithm is a popular algorithm for association rule mining that overcomes
some of the limitations of the Apriori algorithm. It is an efficient algorithm that uses a data
structure called the FP-tree to mine frequent itemsets and generate association rules. The FP-
Growth algorithm reduces the number of scans of the transaction database, making it more
efficient than Apriori, especially for large datasets.
The FP-Growth algorithm works by first creating a frequent pattern tree (FP tree) from the
database. The FP tree is a compact representation of the database that stores the itemsets that
occur in the database and their frequencies. The FP tree is then scanned to find all of the
frequent itemsets.
Once the frequent itemsets have been found, they can be used to generate association rules.
The FP-Growth algorithm is a powerful tool for finding frequent itemsets and association
rules. It is particularly useful for large databases, as it is much more efficient than the Apriori
algorithm.
FP-Growth algorithm:
1. Building the FP-tree:
• Scan the transaction database to calculate the frequency of each item and identify
frequent 1-itemsets.
• Sort the items in descending order of their frequency.
• Construct the FP-tree by inserting each transaction into the tree, maintaining the
order of items based on their frequency. Each path from the root to a leaf node
represents a transaction, and nodes represent items.
2. Mining frequent itemsets:
• Start with the least frequent item in the tree and grow the conditional FP-tree
for each item.
• For each item, create a conditional pattern base by collecting the paths in the
FP- tree that contain that item.
• From the conditional pattern base, construct a conditional FP-tree recursively.
3. Mine frequent itemsets from the conditional FP-tree using a depth-first search
approach. Generating association rules:
• For each frequent itemset, generate association rules by recursively splitting the
itemset into antecedent and consequent.
• Calculate the confidence of each rule, which is the support of the itemset divided
by the support of the antecedent.
• Prune the rules that do not meet the minimum confidence threshold.
Transaction 1: {A, B, C, D}
Transaction 2: {A, C, D}
Transaction 3: {A, B, D}
Transaction 4: {A, D}
Transaction 5: {B, D}
Transaction 6: {B, C,
D} Transaction 7: {A, C}
Transaction 8: {B, C}
Transaction 9: {C, D}
First, we scan the database to calculate the frequency of each item. The frequencies are as
follows:
A: 5
B: 4
C: 5
D: 8
null
|
(D, 8)
|
|
| |
(A, 5) (C, 5)
| |
| |
(B, 3) (C, 3)
|
|
(C, 2)
The numbers in parentheses represent the frequency of each item in the tree.
Starting with the least frequent item, we grow the conditional FP-trees for each item.
For item B:
- Create a conditional pattern base for B: {(D, 3), (D, 1)}
- Construct the conditional FP-tree for B:
null
|
(D, 4)
|
|
| |
(C, 3) (null)
|
|
(C, 2)
- Mine frequent itemsets from the conditional FP-tree for B. In this case, we get the
following frequent itemsets: {B}, {B, C}, {B, C, D}.
For item C:
- Create a conditional pattern base for C: {(D, 5), (D, 3), (D, 2)}
- Construct the conditional FP-tree for C:
null
|
(D, 8)
|
|
| |
(null) (A, 2)
- Mine frequent itemsets from the conditional FP-tree for C. In this case, we get the
following frequent itemsets: {C}, {C, D}.
For item A:
- Create a conditional pattern base for A: {(D, 5), (D, 2)}
- Construct the conditional FP-tree for A:
null
|
(D, 7)
|
|
| |
(null) (C, 2)
- Mine frequent itemsets from the conditional FP-tree for A. In this case, we get the
following frequent itemsets: {A}, {A, D}.
Using the frequent itemsets obtained from the conditional FP-trees, we can generate
association rules. Let's consider a minimum confidence threshold of 50%.
From the frequent itemset {B, C, D}, we can generate the following association rules:
- {B, C} -> {D}
- {B, D} -> {C}
- {C, D} -> {B}
- {B} -> {C, D}
- {C} -> {B, D}
- {D} -> {B, C}
Calculating the confidence for each rule, we can prune the rules that do not meet the
minimum confidence threshold.
For example, if the confidence threshold is 50%, the rule {B, C} -> {D} may have a confidence
of 75%, indicating that 75% of transactions containing B and C also contain D. This rule would
be considered valid.
By applying the FP-Growth algorithm to the given transaction database, we have successfully
mined frequent itemsets and generated association rules.
Note: The example provided is simplified for demonstration purposes. In practice, FP-
Growth can handle much larger datasets and more complex transaction structures efficiently.
ISSUES, GAPS AND CHALLENGES OF ASSOCIATION RULE MINING
In research, there are several common issues, gaps, and challenges that can affect association rule
mining. These include:
1. Scalability: Association rule mining algorithms can face scalability challenges when
dealing with large datasets or high-dimensional data. Mining frequent itemsets and
generating rules from such datasets can be computationally expensive and time-
consuming.
3. Parameter Selection: Association rule mining algorithms often require the selection of
various parameters, such as minimum support and minimum confidence thresholds.
Determining optimal parameter values can be challenging, as it depends on the specific
dataset and the goals of the analysis. Inaccurate parameter selection may lead to the
generation of irrelevant or uninteresting rules.
4. Rule Quality and Interestingness: Assessing the quality and interestingness of generated
rules is a subjective task. Defining appropriate measures and criteria to evaluate rule
quality, interestingness, and significance is an ongoing research challenge. Researchers
strive to develop measures that can capture meaningful and valuable patterns while
filtering out noisy or spurious rules.
6. Handling Sparse Data: Association rule mining algorithms often struggle with sparse
datasets, where most of the item combinations do not occur frequently. Sparse data
can result in low support values and may lead to the generation of weak or
uninformative rules. Developing techniques to handle sparse data and extract reliable
rules is an ongoing challenge.
10. Real-time and Streaming Data: Traditional association rule mining algorithms are
designed for static datasets. However, in real-time or streaming scenarios, new data
arrives continuously, requiring dynamic and incremental mining approaches.
Developing algorithms that can handle streaming data and provide timely insights is an
active area of research.