0% found this document useful (0 votes)
3 views17 pages

Unit IV DWDM

Association rule mining is a data mining technique used to discover interesting relationships within large datasets, particularly in transactional databases. Key concepts include support, confidence, and the Apriori algorithm, which efficiently identifies frequent itemsets and generates association rules based on user-defined thresholds. Despite its usefulness in various fields, the Apriori algorithm can be slow and inefficient with large datasets due to the extensive number of candidate itemsets it generates.

Uploaded by

ckesava474
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views17 pages

Unit IV DWDM

Association rule mining is a data mining technique used to discover interesting relationships within large datasets, particularly in transactional databases. Key concepts include support, confidence, and the Apriori algorithm, which efficiently identifies frequent itemsets and generates association rules based on user-defined thresholds. Despite its usefulness in various fields, the Apriori algorithm can be slow and inefficient with large datasets due to the extensive number of candidate itemsets it generates.

Uploaded by

ckesava474
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

ASSOCIATION RULE MINING

Association rule mining is a technique in data mining that aims to discover interesting relationships or
patterns within large datasets. It is particularly useful for uncovering associations between different
variables or items in transactional databases, such as items frequently purchased together in a retail
setting or symptoms co-occurring in medical records.

Here are some key concepts and steps involved in association rule mining:

1. Transaction Database:

• Association rule mining typically starts with a transactional database, where each
transaction represents a set of items. For example, a transaction could be a customer's
shopping basket containing various products.

2. Support:

• Support is a measure of how frequently an itemset (a set of items) appears in the


dataset. It is calculated as the number of transactions containing the itemset divided by
the total number of transactions. High support indicates that the itemset is common in
the dataset.

3. Confidence:

• Confidence measures the reliability of an association rule. It is the probability of finding


the consequent (the item you want to predict) in a transaction given that the
transaction contains the antecedent (the item(s) used for prediction). High confidence
indicates a strong association between the antecedent and the consequent.

4. Association Rules:

• Association rules are typically expressed in the form "IF {antecedent} THEN
{consequent}." For example, "IF {diapers} THEN {baby formula}." These rules are derived
based on the support and confidence thresholds set by the user.

5. Apriori Algorithm:

• The Apriori algorithm is a popular algorithm for association rule mining. It uses a
breadth-first search strategy to discover frequent itemsets and generate association
rules efficiently. The algorithm relies on the Apriori property, which states that if an
itemset is frequent, all of its subsets must also be frequent.

6. Pruning:

• To optimize the process of rule discovery, pruning techniques are often employed to
eliminate itemsets or rules that do not meet certain criteria, such as minimum support
or confidence thresholds.

7. Lift:
• Lift is another measure used in association rule mining. It compares the likelihood of the
consequent occurring when the antecedent is present to the likelihood of the
consequent occurring in general. A lift value greater than 1 indicates a positive
correlation between the antecedent and consequent.

Association rule mining is widely used in various fields, including retail, healthcare, and finance, to
uncover hidden patterns in large datasets. However, it's essential to interpret the results carefully, as
associations do not imply causation, and some discovered rules may be spurious or coincidental.

FREQUENT ITEMSET GENERATION

Frequent itemset generation is a crucial step in association rule mining, particularly in algorithms like
Apriori. The goal is to identify sets of items that occur together frequently in a dataset. Here's an
overview of the process:

1. Support Count:

• The support count of an itemset is the number of transactions in which the itemset
appears. It is the basis for determining the frequency of itemsets. For example, if you're
working with a retail dataset, the support count of an itemset {A, B} is the number of
transactions containing both items A and B.

2. Support Threshold:

• The support threshold is a user-defined parameter that sets the minimum support count
or percentage required for an itemset to be considered "frequent." Itemsets that meet
or exceed this threshold are considered candidates for further analysis.

3. Apriori Algorithm:

• The Apriori algorithm is a classic algorithm for generating frequent itemsets. It uses a
level-wise, breadth-first search strategy to discover frequent itemsets of increasing size.
The algorithm relies on the Apriori property, which states that if an itemset is frequent,
all of its subsets must also be frequent.

4. Algorithm Steps:

• Here are the basic steps of the Apriori algorithm:

• Initialization: Identify frequent 1-itemsets (single items) by scanning the


database and counting their support.

• Iteration: Generate candidate itemsets of size k based on frequent itemsets of


size k-1. Prune candidate itemsets that have infrequent subsets.

• Counting: Count the support of each candidate itemset by scanning the


database.

• Pruning: Eliminate candidate itemsets that do not meet the minimum support
threshold.
• Repeat: Repeat the process until no new frequent itemsets can be found.

5. Example:

• Suppose you have a transactional dataset with the following transactions:

• T1: {A, B, C}

• T2: {A, B}

• T3: {A, C}

• T4: {B, C}

• T5: {B}

• With a minimum support threshold of 2, the initial frequent 1-itemsets are {A}, {B}, and
{C}. Then, the algorithm iteratively generates and prunes candidate itemsets of higher
sizes until no more frequent itemsets can be found.

6. Performance Optimization:

• To improve efficiency, the Apriori algorithm often uses techniques such as pruning
(eliminating candidates with infrequent subsets) and the hash tree structure for
counting support.

Frequent itemset generation is a fundamental step in association rule mining, providing the basis for
discovering meaningful patterns and relationships in large datasets.

THE APRIORI PRINCIPLE


The Apriori principle is a fundamental concept in association rule mining and is a key foundation for the
Apriori algorithm. Proposed by Rakesh Agrawal and Ramakrishnan Srikant in 1994, the Apriori principle
helps reduce the search space when discovering frequent itemsets in a dataset.

The Apriori principle is based on the observation that if an itemset is frequent, then all of its subsets
must also be frequent. This principle is often expressed as:

If an itemset is infrequent, all its supersets will also be infrequent.

Conversely:

If an itemset is frequent, all its subsets will also be frequent.

This principle is crucial for efficiently identifying frequent itemsets in a large dataset without having to
examine all possible combinations. The Apriori algorithm uses the Apriori principle to generate
candidate itemsets and prune those that cannot be frequent based on the downward closure property.

Here's how the Apriori principle is applied in the context of the Apriori algorithm:

1. Generate Frequent 1-Itemsets:

• Initially, the algorithm identifies frequent 1-itemsets (individual items) by counting their
support in the dataset.
2. Generate Candidate Itemsets:

• For subsequent iterations, the algorithm generates candidate itemsets of size k based
on frequent itemsets of size k-1.

3. Prune Based on Apriori Principle:

• Before counting the support of candidate itemsets, the algorithm prunes candidates
that have infrequent subsets. This pruning step is possible due to the Apriori principle,
which ensures that if a candidate itemset is infrequent, any of its supersets (larger
itemsets) will also be infrequent.

4. Count Support and Repeat:

• After pruning, the algorithm counts the support of the remaining candidate itemsets in
the dataset. Frequent itemsets are retained, and the process is repeated until no new
frequent itemsets can be found.

By leveraging the Apriori principle and the downward closure property, the Apriori algorithm efficiently
explores the search space of potential frequent itemsets, avoiding the need to examine all possible
combinations and reducing the computational cost of association rule mining.

APRIORI ALGORITHM
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset
for boolean association rule. Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties. We apply an iterative approach or level-wise search where k-frequent
itemsets are used to find k+1 itemsets.

To improve the efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.

Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its
anti-monotonicity of support measure. Apriori assumes that

All subsets of a frequent itemset must be frequent(Apriori property).


If an itemset is infrequent, all its supersets will be infrequent.

Before we start understanding the algorithm, go through some definitions which are explained in my
previous post.
Consider the following dataset and we will find frequent itemsets and generate association rules for
them.
minimum support count is 2
minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives us
itemset L1.

Step-2: K=2

• Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-1 and Lk-1 is
that it should have (K-2) elements in common.

• Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.

Step-3:

• Generate candidate set C3 using L2 (join step). Condition of joining L k-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2,
I3, I5}

• Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2,
I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)

• find support count of these remaining itemset by searching in dataset.


(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.

Step-4:

• Generate candidate set C4 using L3 (join step). Condition of joining L k-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.

• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining
L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset
in C4

• We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes
into picture. For that we need to calculate confidence of each rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of candidate
sets with much frequent itemsets, low minimum support or large itemsets i.e. it is not an efficient
approach for large number of datasets. For example, if there are 10^4 from frequent 1- itemsets, it need
to generate more than 10^7 candidates into 2-length which in turn they will be tested and accumulate.
Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have to generate 2^100
candidate itemsets that yield on costly and wasting of time of candidate generation. So, it will check for
many sets from candidate itemsets, also it will scan database many times repeatedly for finding
candidate itemsets. Apriori will be very low and inefficiency when memory capacity is limited with large
number of transactions.

RULE GENERATION
Rule generation in the context of association rule mining, specifically using algorithms like Apriori,
involves deriving meaningful relationships or patterns from the discovered frequent itemsets. Once
frequent itemsets are identified, association rules are generated to express the associations between
items. These rules are in the form "IF {antecedent} THEN {consequent}" and provide insights into the co-
occurrence of items in the dataset.

Here's a step-by-step process for rule generation:

1. Frequent Itemset Discovery:

• Before generating rules, you need to identify frequent itemsets using an algorithm such
as Apriori. Frequent itemsets are sets of items that occur together frequently in the
dataset.

2. Rule Generation:

• For each frequent itemset, generate association rules by considering different


combinations of items as antecedents and consequents. A rule must have at least one
item in the antecedent and one item in the consequent.

3. Rule Evaluation:

• Evaluate the quality of each rule using metrics such as support, confidence, and lift.
These metrics help determine the significance and reliability of the discovered
associations.

• Support: The proportion of transactions in the dataset that contain both the
antecedent and the consequent.

• Confidence: The probability of finding the consequent in a transaction given


that the antecedent is present.


• Lift: Measures the degree to which the antecedent and consequent are
dependent, considering their individual occurrences.

4. Pruning Rules:

• Apply additional filtering criteria, such as setting a minimum confidence threshold, to


retain only high-quality and interesting rules. Pruning helps focus on rules that are more
likely to be meaningful or actionable.

5. Presentation and Interpretation:

• Present the generated rules in a human-readable format, making it easy for users to
understand the relationships between different items. This might involve sorting the
rules based on confidence or support.

6. Iterative Refinement:

• Depending on the specific goals of the analysis, you may need to iteratively refine the
rule generation process by adjusting parameters, such as support and confidence
thresholds, or by considering additional domain-specific constraints.

7. Rule Application:

• Once you have a set of high-quality rules, you can apply them to new data to make
predictions or gain insights. For example, in a retail setting, if you discover a rule like "IF
{bread} THEN {butter}," it suggests that customers who buy bread are likely to buy
butter as well.

It's important to note that association rules do not imply causation, and the interpretation of rules
should be done cautiously. Additionally, the effectiveness of rules depends on the quality of the data
and the appropriateness of the algorithm and parameters chosen for rule generation.

COMPACT REPRESENTATION OF FREQUENT ITEMSETS


Compact representation of frequent itemsets is a way to represent and store the discovered patterns in
a more efficient and concise manner. This is particularly important when dealing with large datasets, as
the raw enumeration of all frequent itemsets can be computationally expensive and memory-intensive.
There are several techniques for compactly representing frequent itemsets:

1. Closed Itemsets:

• Closed itemsets are a compact representation that eliminates redundant information.


An itemset is closed if none of its supersets has the same support. Therefore, for a
closed itemset, adding any other item would result in a decrease in support. Closed
itemsets capture all the essential information about frequent itemsets without
redundancy.

2. Maximal Itemsets:

• Maximal itemsets are another compact representation that retains only those itemsets
that are not subsets of any other frequent itemset. Unlike closed itemsets, maximal
itemsets do not consider support; they only focus on the structure of the itemsets.
Maximal itemsets provide a more compact representation by excluding subsets that do
not add new information.

3. Association Rules:

• Instead of storing all frequent itemsets separately, one can store only the high-
confidence association rules. These rules capture the essential relationships between
items in a more human-readable and actionable format. The compactness comes from
representing associations rather than individual itemsets.

4. Tree-Based Structures:

• Some methods use tree-based structures, such as FP-growth (Frequent Pattern growth),
to compactly represent frequent itemsets. FP-growth builds a compressed data
structure called the FP-tree, which facilitates efficient mining of frequent itemsets
without the need to explicitly generate and store all possible itemsets.

5. Bitwise Representations:

• In databases where items have unique identifiers, bitwise representations can be used
to compactly represent itemsets. Each item corresponds to a bit, and itemsets are
represented as bit vectors, making it computationally efficient for certain operations.

6. Vertical Data Format:

• In some cases, representing data in a vertical format where each item has its list of
transactions can lead to a more compact representation, especially when dealing with
sparse datasets.

The choice of compact representation depends on the specific requirements of the analysis, the
characteristics of the dataset, and the goals of the mining process. Each compact representation method
has its advantages and trade-offs in terms of storage efficiency, computational complexity, and ease of
interpretation.

THE FP-GROWTH (FREQUENT PATTERN GROWTH) ALGORITHM

The FP-Growth (Frequent Pattern Growth) algorithm is an efficient algorithm for mining frequent
itemsets from transactional databases. It was introduced by Jiawei Han, Jian Pei, and Yiwen Yin in their
paper "Mining Frequent Patterns without Candidate Generation" in 2000. FP-Growth is particularly well-
suited for large datasets and is an alternative to the Apriori algorithm.

Here's an overview of how the FP-Growth algorithm works:

1. Build the FP-Tree:

• Scan the transactional database and construct a data structure called the FP-Tree
(Frequent Pattern Tree).
• The FP-Tree is built by inserting each transaction into the tree. Items within a
transaction are added as nodes, and the tree structure helps represent the relationships
between different items.

2. Generate Conditional Pattern Bases:

• For each frequent item in the dataset, create a conditional pattern base by removing the
frequent item from the original transactions and keeping the remaining structure. This
step is performed recursively.

3. Construct Conditional FP-Trees:

• For each conditional pattern base, build a conditional FP-Tree. This is essentially a
smaller FP-Tree constructed from the conditional pattern base.

4. Mine Frequent Itemsets from Conditional FP-Trees:

• Recursively mine frequent itemsets from each conditional FP-Tree. This process involves
repeating the steps of building conditional pattern bases and constructing conditional
FP-Trees until no more frequent itemsets can be found.

5. Combine Frequent Itemsets:

• Combine the frequent itemsets obtained from the conditional FP-Trees with the
frequent itemsets from the original transactions to obtain the complete set of frequent
itemsets.

The key advantages of FP-Growth include:

• No Candidate Generation:

• Unlike the Apriori algorithm, FP-Growth does not generate candidate itemsets explicitly.
It constructs the FP-Tree directly from the dataset, avoiding the need to generate and
test multiple candidate itemsets.

• Efficiency:

• FP-Growth can be more efficient than traditional algorithms, especially when dealing
with large datasets, as it reduces the number of passes over the data and avoids the
generation of an explicit candidate set.

• Compact Data Structure:

• The FP-Tree is a compact data structure that captures the frequency information in a
condensed form, making it efficient for frequent pattern mining.

While FP-Growth is generally efficient, its performance depends on the characteristics of the dataset. It
is well-suited for datasets with a large number of transactions and a relatively small number of unique
items.
FP-GROWTH ALGORITHM
The two primary drawbacks of the Apriori Algorithm are:

1. At each step, candidate sets have to be built.

2. To build the candidate sets, the algorithm has to repeatedly scan the database.

These two properties inevitably make the algorithm slower. To overcome these redundant steps, a new
association-rule mining algorithm was developed named Frequent Pattern Growth Algorithm. It
overcomes the disadvantages of the Apriori algorithm by storing all the transactions in a Trie Data
Structure. Consider the following data:-

The above-given data is a hypothetical dataset of transactions with each letter representing an item. The
frequency of each individual item is computed:-

Let the minimum support be 3. A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support. These elements are stored in descending
order of their respective frequencies. After insertion of the relevant items, the set L looks like this:-

L = { K : 5, E : 4, M : 3, O : 3, Y : 3 }

Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the Frequent
Pattern set and checking if the current item is contained in the transaction in question. If the current
item is contained, the item is inserted in the Ordered-Item set for the current transaction. The following
table is built for all the transactions:

Now, all the Ordered-Item sets are inserted into a Trie Data Structure.

a) Inserting the set {K, E, M, O, Y}:

Here, all the items are simply linked one after the other in the order of occurrence in the set and
initialize the support count for each item as 1.

b) Inserting the set {K, E, O, Y}:

Till the insertion of the elements K and E, simply the support count is increased by 1. On inserting O we
can see that there is no direct link between E and O, therefore a new node for the item O is initialized
with the support count as 1 and item E is linked to this new node. On inserting Y, we first initialize a new
node for the item Y with support count as 1 and link the new node of O with the new node of Y.
c) Inserting the set {K, E, M}:

Here simply the support count of each element is increased by 1.

d) Inserting the set {K, M, Y}:

Similar to step b), first the support count of K is increased, then new nodes for M and Y are initialized
and linked accordingly.
e) Inserting the set {K, E, O}:

Here simply the support counts of the respective elements are increased. Note that the support count of
the new node of item O is increased.

Now, for each item, the Conditional Pattern Base is computed which is path labels of all the paths which
lead to any node of the given item in the frequent-pattern tree. Note that the items in the below table
are arranged in the ascending order of their frequencies.
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the set of
elements that is common in all the paths in the Conditional Pattern Base of that item and calculating its
support count by summing the support counts of all the paths in the Conditional Pattern Base.

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the
items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the
below table.
For each row, two types of association rules can be inferred for example for the first row which contains
the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule, the confidence of
both the rules is calculated and the one with confidence greater than or equal to the minimum
confidence value is retained.

You might also like