Association Rule Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

SUB TOPIC: ASSOCIATION RULE MINING

The purchasing of one product when another product is purchased represents an association
rule. Association rules are frequently used by retail stores to assist in marketing, advertising,
floor placement, and inventory control. Although they have direct applicability to retail
businesses, they have been used for other purposes as well, including predicting faults in
telecommunication networks. Association rules are used to show the relationships between
data items. These uncovered relationships are not inherent in the data, as with functional
dependencies, and they do not represent any sort of causality or correlation. Instead,
association rules detect common usage of items. Example 2. 1 illustrates this

Example 2.1
A grocery store chain keeps a record of weekly transactions where each transaction represents
the items bought during one cash register transaction. The executives of the chain receive a
summarized report of the transactions indicating what types of items have sold at what
quantity. In addition, they periodically request information about what items are commonly
purchased together. They find that 100% of the time that Peanut-Butter is purchased, so is
Bread. In addition, 33.3% of the time Peanut-Butter is purchased, Jelly is also purchased.
However, Peanut-Butter exists in only about 50% of the overall transactions.

An association rule is an expression of , where X is a set of items, and Y is a single


item. Association rule methods are an initial data exploration approach that is often applied to
extremely large data set.

The table below shows a sample of five transactions and five items: {Beer, Bread, Jelly,
Milk, Peanut- Butter}.

Association rules mining can be applied to the above dataset to perform a market basket analysis.
i.e determining items that are frequently bought together.

Two important measures in association rule mining, support and confidence.


Support measures the frequency with which a given itemset or relationship appears in
the dataset. It is calculated by dividing the number of transactions that contain both the
antecedent and consequent items by the total number of transactions. In other words,
it measures the degree to which a relationship exists in the data. For example, if a
support value of 0.05 is set for the itemset {X, Y}, then any pair of items that appears
together in at least 5% of the transactions will be considered a frequent itemset.

Confidence, on the other hand, measures the strength of the relationship between the
antecedent and consequent items. It is defined as the proportion of transactions that
contain both the antecedent and consequent items, out of the total number of
transactions that contain the antecedent item. In other words, it measures the
conditional probability that the consequent item will occur in a transaction given that
the antecedent item is already present in that transaction. For example, if a confidence
value of 0.8 is set for the rule X --> Y, it means that 80% of the transactions that
contain item X also contain item Y.

Also, minimum support and minimum confidence values are important metrics. These
are user-specified thresholds that can be used to filter out frequent itemsets or
association rules that are deemed too rare or weak. For example, if the minimum
support threshold is set to 0.1, only itemsets that appear in at least 10% of the
transactions will be considered frequent. Similarly, if the minimum confidence threshold
for a rule X --> Y is set to 0.7, only rules for which at least 70% of the transactions
containing X also contain Y will be considered valid.

Association rules Mining Algorithms


1. Apriori Algorithm
2. FP-Growth Algorithm

Apriori Algorithm
The Apriori algorithm is a classic and widely used data mining technique for discovering
frequent itemsets in transactional datasets. It's primarily used in market basket analysis, which
involves finding associations between items that are often purchased together. This algorithm
helps retailers and businesses understand customer purchasing patterns and make decisions
related to product placement, promotion strategies, and more.

Here's how the Apriori algorithm works:

1) Support and Confidence generation:


-Support: The support of an itemset is the proportion of transactions in the dataset that
contain that itemset. It indicates how frequently the itemset appears.
- Confidence: Confidence measures the likelihood that an item Y is purchased when item X is
purchased. It's calculated as the support of the itemset {X, Y} divided by the support of itemset
X.
2) Minimum Support and Confidence: Before running the algorithm, you need to set
minimum support and confidence thresholds. These thresholds determine which itemsets are
considered significant. Items or itemsets that don't meet these thresholds are filtered out.
3) Generating Candidate Itemsets: The algorithm starts by scanning the dataset to determine
the support of individual items (1-item itemsets). Items that meet the minimum support
threshold are considered frequent 1-item itemsets. These frequent itemsets will serve as the
basis for generating larger itemsets.
4) Generating Larger Itemsets: The algorithm iterates through the process of generating larger
itemsets by combining frequent (k-1)-item itemsets to form k-item candidate itemsets. These
candidates are generated by joining two (k-1)-item itemsets if their first k-2 items are the same.
Then, it prunes any candidate itemsets that have subsets that are not frequent.
5) Pruning Step: During the generation of larger itemsets, some candidates are pruned if they
contain subsets that are not frequent. This is based on the "Apriori property," which states that
if an itemset is infrequent, all its supersets will also be infrequent.
6) Counting Support and Repeating: After generating candidate itemsets, the algorithm scans
the dataset again to count the actual support of each candidate itemset. Candidate itemsets
that meet the minimum support threshold become frequent k-item itemsets, and the process
continues with the next iteration to generate larger (k+1)-item candidates.
7) Association Rule Generation: Once all frequent itemsets are discovered, the algorithm
derives association rules from these itemsets. Association rules are generated by considering the
frequent itemsets as the antecedent and generating all possible consequents. Rules are then
evaluated based on their confidence, and those meeting the minimum confidence threshold
are considered significant.

To practice Association Rule mining using Apriori Algorithm and Python Programming follow
the link below.
https://youtu.be/r-vymRtEzN8

FP-GROWTH ALGORITHM

The FP-Growth algorithm is an association rule mining algorithm that is used to find frequent
itemsets in a large database. It is a more efficient alternative to the Apriori algorithm, as it
does not require the generation of candidate itemsets.

The FP-Growth algorithm is a popular algorithm for association rule mining that overcomes
some of the limitations of the Apriori algorithm. It is an efficient algorithm that uses a data
structure called the FP-tree to mine frequent itemsets and generate association rules. The FP-
Growth algorithm reduces the number of scans of the transaction database, making it more
efficient than Apriori, especially for large datasets.

The FP-Growth algorithm works by first creating a frequent pattern tree (FP tree) from the
database. The FP tree is a compact representation of the database that stores the itemsets that
occur in the database and their frequencies. The FP tree is then scanned to find all of the
frequent itemsets.

Once the frequent itemsets have been found, they can be used to generate association rules.
The FP-Growth algorithm is a powerful tool for finding frequent itemsets and association
rules. It is particularly useful for large databases, as it is much more efficient than the Apriori
algorithm.

FP-Growth algorithm:
1. Building the FP-tree:
• Scan the transaction database to calculate the frequency of each item and identify
frequent 1-itemsets.
• Sort the items in descending order of their frequency.
• Construct the FP-tree by inserting each transaction into the tree, maintaining the
order of items based on their frequency. Each path from the root to a leaf node
represents a transaction, and nodes represent items.
2. Mining frequent itemsets:
• Start with the least frequent item in the tree and grow the conditional FP-tree
for each item.
• For each item, create a conditional pattern base by collecting the paths in the
FP- tree that contain that item.
• From the conditional pattern base, construct a conditional FP-tree recursively.
3. Mine frequent itemsets from the conditional FP-tree using a depth-first search
approach. Generating association rules:
• For each frequent itemset, generate association rules by recursively splitting the
itemset into antecedent and consequent.
• Calculate the confidence of each rule, which is the support of the itemset divided
by the support of the antecedent.
• Prune the rules that do not meet the minimum confidence threshold.

The advantages of the FP-Growth algorithm:


• It is more efficient than the Apriori algorithm.
• It can be used to find frequent itemsets in large databases.
• It can be used to generate association rules.
• Reduced database scans: The FP-Growth algorithm requires only two passes over
the transaction database, making it more efficient for large datasets compared to the
multiple scans needed by Apriori.
• Compact data structure: The FP-tree data structure allows for efficient storage of
transaction data and reduces the memory requirement compared to the Apriori
algorithm.
• No candidate generation: Unlike Apriori, the FP-Growth algorithm does not generate
candidate itemsets, which can be computationally expensive. Instead, it uses the FP-tree
structure to directly extract frequent itemsets.
• Efficient mining of frequent itemsets: By using the FP-tree and conditional pattern base,
the FP-Growth algorithm can efficiently mine frequent itemsets without generating all
possible combinations, leading to improved performance.

Some of the disadvantages of the FP-Growth algorithm:


• It can be difficult to understand and implement.
• It can be sensitive to the minimum support threshold.
• The FP-Growth algorithm may have limitations in handling datasets with high-
dimensional data or very long patterns, as it may result in a large number of nodes in the
FP-tree and conditional FP-trees, leading to increased memory usage and computational
complexity.

FP-GROWTH IMPLEMENTATION Example

Consider the following transaction database:

Transaction 1: {A, B, C, D}
Transaction 2: {A, C, D}
Transaction 3: {A, B, D}
Transaction 4: {A, D}
Transaction 5: {B, D}
Transaction 6: {B, C,
D} Transaction 7: {A, C}
Transaction 8: {B, C}
Transaction 9: {C, D}

Step 1: Building the FP-tree

First, we scan the database to calculate the frequency of each item. The frequencies are as
follows:

A: 5
B: 4
C: 5
D: 8

Based on the frequencies, we sort the items in descending order: D, A, C, B.

The initial FP-tree will look like this:

null
|
(D, 8)
|
|
| |
(A, 5) (C, 5)
| |
| |
(B, 3) (C, 3)
|
|
(C, 2)

The numbers in parentheses represent the frequency of each item in the tree.

Step 2: Mining frequent itemsets

Starting with the least frequent item, we grow the conditional FP-trees for each item.

For item B:
- Create a conditional pattern base for B: {(D, 3), (D, 1)}
- Construct the conditional FP-tree for B:

null
|
(D, 4)
|
|
| |
(C, 3) (null)
|
|
(C, 2)

- Mine frequent itemsets from the conditional FP-tree for B. In this case, we get the
following frequent itemsets: {B}, {B, C}, {B, C, D}.

For item C:
- Create a conditional pattern base for C: {(D, 5), (D, 3), (D, 2)}
- Construct the conditional FP-tree for C:

null
|
(D, 8)
|
|
| |
(null) (A, 2)

- Mine frequent itemsets from the conditional FP-tree for C. In this case, we get the
following frequent itemsets: {C}, {C, D}.

For item A:
- Create a conditional pattern base for A: {(D, 5), (D, 2)}
- Construct the conditional FP-tree for A:

null
|
(D, 7)
|
|
| |
(null) (C, 2)

- Mine frequent itemsets from the conditional FP-tree for A. In this case, we get the
following frequent itemsets: {A}, {A, D}.

Step 3: Generating association rules

Using the frequent itemsets obtained from the conditional FP-trees, we can generate
association rules. Let's consider a minimum confidence threshold of 50%.

From the frequent itemset {B, C, D}, we can generate the following association rules:
- {B, C} -> {D}
- {B, D} -> {C}
- {C, D} -> {B}
- {B} -> {C, D}
- {C} -> {B, D}
- {D} -> {B, C}

Calculating the confidence for each rule, we can prune the rules that do not meet the
minimum confidence threshold.

For example, if the confidence threshold is 50%, the rule {B, C} -> {D} may have a confidence
of 75%, indicating that 75% of transactions containing B and C also contain D. This rule would
be considered valid.

By applying the FP-Growth algorithm to the given transaction database, we have successfully
mined frequent itemsets and generated association rules.

Note: The example provided is simplified for demonstration purposes. In practice, FP-
Growth can handle much larger datasets and more complex transaction structures efficiently.
ISSUES, GAPS AND CHALLENGES OF ASSOCIATION RULE MINING
In research, there are several common issues, gaps, and challenges that can affect association rule
mining. These include:

1. Scalability: Association rule mining algorithms can face scalability challenges when
dealing with large datasets or high-dimensional data. Mining frequent itemsets and
generating rules from such datasets can be computationally expensive and time-
consuming.

2. Efficiency: Efficiency is a crucial aspect of association rule mining. Many traditional


algorithms have high time and space complexity, making them inefficient for large
datasets. Improving the efficiency of mining algorithms is an ongoing research challenge.

3. Parameter Selection: Association rule mining algorithms often require the selection of
various parameters, such as minimum support and minimum confidence thresholds.
Determining optimal parameter values can be challenging, as it depends on the specific
dataset and the goals of the analysis. Inaccurate parameter selection may lead to the
generation of irrelevant or uninteresting rules.

4. Rule Quality and Interestingness: Assessing the quality and interestingness of generated
rules is a subjective task. Defining appropriate measures and criteria to evaluate rule
quality, interestingness, and significance is an ongoing research challenge. Researchers
strive to develop measures that can capture meaningful and valuable patterns while
filtering out noisy or spurious rules.

5. Handling High-Dimensional Data: Association rule mining algorithms can face


difficulties when dealing with high-dimensional data, where the number of items or
attributes is large. High dimensionality can result in a combinatorial explosion of
itemsets, making it challenging to discover meaningful rules. Developing techniques to
handle high-dimensional data and extract useful patterns is an active area of research.

6. Handling Sparse Data: Association rule mining algorithms often struggle with sparse
datasets, where most of the item combinations do not occur frequently. Sparse data
can result in low support values and may lead to the generation of weak or
uninformative rules. Developing techniques to handle sparse data and extract reliable
rules is an ongoing challenge.

7. Handling Continuous and Heterogeneous Data: Traditional association rule mining


algorithms are designed for categorical or binary data. However, real-world datasets
often contain continuous or heterogeneous attributes. Developing algorithms that can
handle continuous and mixed-type data and effectively mine association rules from such
datasets is an active research area.

8. Incorporating Domain Knowledge: Association rule mining may generate a large


number of rules, including many trivial or uninteresting ones. Incorporating domain
knowledge and constraints into the mining process can help guide the discovery of more
meaningful and actionable rules. Developing techniques to incorporate domain
knowledge effectively is a research challenge.

9. Interpretability and Visualization: Interpreting and visualizing large sets of association


rules can be challenging. Presenting the rules in a concise and understandable manner to
facilitate decision-making and knowledge discovery is an ongoing research focus.

10. Real-time and Streaming Data: Traditional association rule mining algorithms are
designed for static datasets. However, in real-time or streaming scenarios, new data
arrives continuously, requiring dynamic and incremental mining approaches.
Developing algorithms that can handle streaming data and provide timely insights is an
active area of research.

You might also like