0% found this document useful (0 votes)
40 views

Apriori Algorithm

Association Rule Mining (ARM) is a technique for discovering relationships between items in datasets, utilizing algorithms like Apriori, FP Growth, and ECLAT. The Apriori algorithm identifies frequent itemsets based on support, confidence, and lift, while FP Growth improves efficiency by avoiding candidate generation. ECLAT employs a depth-first search approach to find frequent items using transaction ID sets, offering advantages in memory usage and speed over traditional methods.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Apriori Algorithm

Association Rule Mining (ARM) is a technique for discovering relationships between items in datasets, utilizing algorithms like Apriori, FP Growth, and ECLAT. The Apriori algorithm identifies frequent itemsets based on support, confidence, and lift, while FP Growth improves efficiency by avoiding candidate generation. ECLAT employs a depth-first search approach to find frequent items using transaction ID sets, offering advantages in memory usage and speed over traditional methods.

Uploaded by

quillsbot
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Association Rule

Mining
Gayathri Prasad S
Association Rule Mining

• Association rule mining is a technique to identify underlying


relations between different items.
• In ARM, the frequency of patterns and associations in the
dataset is identified as the item sets which is then used to
predict the next relevant item in the set.
• Different statistical algorithms have been developed to
implement association rule mining, and Apriori, FP Growth,
ECLAT are among them.
Types of Association Rule Mining
Algorithms
• Existing mining algorithm of association rules can be broadly divided into two main
categories: — Horizontal format mining algorithms and Vertical format mining
algorithms. We have a matrix which shows transactions with items, these kind of
matrix can be represented a horizontal or a vertical way.
• The most commonly used layout is the horizontal data layout. That is, each
transaction has a transaction identifier (TID) and a list of items occurring in that
transaction, i.e., {TID:itemset}. Another commonly used layout is the vertical data
layout n which the database consists of a set of items, each followed by the set of
transaction identifiers containing the item, i.e., {item:TID_set}.
• Apriori algorithm uses horizontal format while Eclat can be used only for vertical
format data sets.
Horizontal vs Vertical Data Format
Apriori Algorithm

It searches for a series of frequent sets of items in the datasets.


It builds on associations and correlations between the itemsets.
There are three major components of Apriori algorithm:
• Support
• Confidence
• Lift
Support

• Support refers to the default popularity of an item and can be


calculated by finding number of transactions containing a
particular item divided by total number of transactions.
• Suppose we want to find support for item B. This can be
calculated as:
Support(B) = (Transactions containing (B))/(Total
Transactions)
Confidence

• Confidence refers to the likelihood that an item B is also


bought if item A is bought. It can be calculated by finding the
number of transactions where A and B are bought together,
divided by total number of transactions where A is bought.
Mathematically, it can be represented as:
Confidence(A→B) = (Transactions containing both (A and
B))/(Transactions containing A)
Lift

• Lift(A -> B) refers to the increase in the ratio of sale of B when A


is sold. Lift(A –> B) can be calculated by dividing Confidence(A ->
B) by Support(B). Mathematically it can be represented as:
• Lift(A→B) = (Confidence (A→B))/(Support (B))
• A Lift of 1 means there is no association between products A and
B. Lift of greater than 1 means products A and B are more likely
to be bought together. Finally, Lift of less than 1 refers to the case
where two products are unlikely to be bought together.
Set Conditions

• For large sets of data, there can be hundreds of items in hundreds of thousands transactions.
The Apriori algorithm tries to extract rules for each possible combination of items. This
process can be extremely slow due to the number of combinations. To speed up the process,
we need to perform the following steps:
• Set a minimum value for support and confidence. This means that we are only interested
in finding rules for the items that have certain default existence (e.g. support) and have a
minimum value for co-occurrence with other items (e.g. confidence).
• Extract all the subsets having higher value of support than minimum threshold.
• Select all the rules from the subsets with confidence value higher than minimum
threshold.
• Order the rules by descending order of Lift.
Frequent Item Set

• An itemset whose support is greater than or equal to a


minSup threshold
• Frequent itemsets or also known as frequent pattern simply
means all the itemsets that satisfies the minimum support
threshold.
• The key property of Apriori before building the algorithm is:
• All subsets of a frequent itemset must be frequent.
• If an itemset is infrequent, all its supersets will be infrequent.
Apriori Example
Advantages of Apriori algorithm

• Easy to implement
• Use large itemset property
Shortcomings
There are two major shortcomings of Apriori Algorithms
• The size of itemset from candidate generation could be
extremely large. In general, a dataset that contains k items
can potentially generate up to 2^K itemset. Because K can be
very large in many practical applications, it becomes
computationally expensive.
• Lots of time wasted on counting the support since we have to
scan the itemset database over and over again
Fp Growth Algorithm

• Fp Growth Algorithm (Frequent pattern growth) algorithm is an


improvement to apriori algorithm. FP growth algorithm is used for finding
frequent itemset in a transaction database without candidate generation.
• FP growth represents frequent items in frequent pattern trees or FP-tree.
The purpose of the FP tree is to mine the most frequent pattern. Each
node of the FP tree represents an item of the itemset.
• The root node represents null while the lower nodes represent the
itemsets. The association of the nodes with the lower nodes that is the
itemsets with the other itemsets are maintained while forming the tree.
Advantages of FP growth algorithm

• Faster than apriori algorithm


• No candidate generation
• Only two passes over dataset
Disadvantages of FP growth algorith

• FP tree may not fit in memory


• FP tree is expensive to build
FP Growth- Example
• Let the minimum support be 3. A Frequent Pattern set is built
which will contain all the elements whose frequency is greater
than or equal to the minimum support, in descending order of
their respective frequencies:
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
ECLAT Algorithm

• ECLAT stands for Equivalence Class Transformation


• Eclat algorithm is data mining algorithm which is used to find frequent items.
• Eclat can not use horizontal database. If there is any horizontal database, then we
need to convert into vertical database.
• The main idea is to use intersections of TID sets to compute the candidate support
value and avoid generating subsets that do not exist in the prefix tree. When the
function is called for the first time, all of the individual elements are used in
conjunction with their sets of TIDs. The function is then called recursively, and in
each recursive call, each item-tid set pair is checked and combined with other item-
tid set pairs. This process continues until the candidate-tid set pairs are merged.
Workflow

Step 1 — List the Transaction ID (TID) set of each product


• The first step is to make a list that contains, for each product, a list
of the Transaction IDs in which the product occurs. These transaction
ID lists are is called the Transaction ID Set, also called TID set.
Step 2 — Filter with minimum support
• The next step is to decide on a value called the minimum support.
The minimum support will serve to filter out products that do not
occur often enough to be considered.
Step 3 — Compute the Transaction ID set of each product pair
• We now move on to pairs of products. We will basically repeat the same thing as in
step 1, but now for product pairs. The interesting thing about the ECLAT algorithm is
that this step is done using the Intersection of the two original sets. This makes it
different from the Apriori algorithm. The ECLAT algorithm is faster because it is much
simpler to identify the intersection of the set of transactions IDs than to scan each
individual transaction for the presence of pairs of products (as Apriori does).
Step 4 — Filter out the pairs that do not reach minimum support
• As before, we need to filter out results that do not reach the minimum support
Step 5— Continue as long as you can make new pairs above support
Example

k=2
k =1, minimum support =2 Item Tidset
Item Tidset {Bread, Butter} {T1, T4, T8, T9}
Bread {T1, T4, T5, T7, T8, T9 } {Bread, Milk} {T5, T7, T8, T9}
{Bread, Coke} {T4}
Butter {T1, T2, T3, T4, T6, T8, T9} {Bread, J am} {T1, T8}
Milk {T3, T5, T6, T7, T8, T9} {Butter, Milk} {T3, T6, T8, T9}
{Butter, Coke} {T2, T4}
Coke {T2, T4} {Butter, J am} {T1, T8}
Jam {T1, T8} {Milk, J am} {T8}
k =3
Item Tidset
{Bread, Butter, Milk} {T8, T9}
{Bread, Butter, J am} {T1, T8}

k =4
We will stop at k =4 because there are no more element-tipset pairs to combine.
Since minimum support =2, we conclude the following rules from this dataset: —
Item Tidset
{Bread, Butter, Milk, J am} {T8}
Features of Eclat

Advantages
• Since the Eclat algorithm uses a Depth-First Search approach, it consumes less
memory than the Apriori algorithm
• The Eclat algorithm does not involve in the repeated scanning of the data in
order to calculate the individual support values
• Eclat algorithm scans the currently generated dataset unlike Apriori which
scans the original dataset
Disadvantage
• If the tid list is too large, the Eclat algorithm may run out of memory.
Thank
You

You might also like