Apriori Algorithm
Apriori Algorithm
Apriori Algorithm
Data mining is the process of extracting knowledge, patterns, and insights from
large volumes of data. It involves using various computational techniques and
applying advanced algorithms to discover hidden relationships, patterns, and
trends within massive datasets that can help organizations make informed
decisions.
What are the Association Rules ?
• It determines the support of each 1-itemset by checking if it meets the minimum support threshold.
• The algorithm then grows the itemsets by joining them onto themselves to form new 2-itemsets.
• It determines the support of each 2-itemset and prunes away those that do not meet the minimum
support threshold.
• This growing and pruning process is repeated until no itemsets meet the minimum support threshold.
• Optionally, a maximum number of items or iterations can be specified to limit the size and runtime of the
algorithm.
• The output of the Apriori algorithm is a collection of all the frequent k-itemsets.
• Leverage: similar to lift, compares the observed co-occurrence of X and Y with the
expected co-occurrence if they were statistically independent. It measures the difference
in the probability of X and Y appearing together. A leverage value greater than 0 indicates a
non-random relationship between X and Y.
• By considering lift and leverage along with confidence, it becomes possible to identify
interesting and meaningful rules while filtering out coincidental associations. These
measures help ensure that the discovered rules are not only trustworthy but also
statistically significant.
Advantages and disadvantages of
the Apriori algorithm
• Advantages of the Apriori algorithm:
1. Partitioning: Itemsets potentially frequent in a transaction database must be frequent in at least one of the
partitions of the database. By dividing the database into partitions, the algorithm can reduce the search
space.
2. Sampling: This approach involves extracting a subset of the data with a lower support threshold and
performing association rule mining on the subset. This reduces the computational overhead.
3. Transaction reduction: Transactions that do not contain frequent k-itemsets are irrelevant for subsequent
scans and can be ignored. This reduces the number of transactions that need to be processed.
Conclusion:
• Association rules are unsupervised analysis techniques used to uncover relationships among items in datasets.
They have various applications, including market basket analysis, clickstream analysis, and recommendation
engines. While association rules don't predict outcomes, they excel at identifying interesting and non-obvious
relationships that provide valuable insights for improving business operations.
• The Apriori algorithm is a fundamental algorithm for generating association rules. This chapter demonstrated
the steps of the Apriori algorithm using a grocery store example to generate frequent itemsets and useful rules.
Measures such as support, confidence, lift, and leverage were discussed to evaluate the rules and distinguish
interesting relationships from coincidental ones. The chapter also outlined the advantages and disadvantages of
the Apriori algorithm and suggested methods to enhance its efficiency.
Exercises from the book
Exercises from the book
• What is the Apriori property? The Apriori property is a principle in association rule mining that states if an itemset is
frequent, then all of its subsets must also be frequent. It means that if a set of items occurs frequently in a dataset, then
any subset of that set is also expected to occur frequently. This property is used to reduce the search space and improve
the efficiency of mining association rules by eliminating infrequent itemsets and their subsets. The Apriori algorithm
leverages this property to incrementally build larger itemsets from frequent smaller itemsets.
• Following is a list of five transactions that include items A, B, C, and D:
T1 : { A,B,C } T2 : { A,C } T3 : { B,C } T4 : { A,D } T5 : { A,C,D }
Which itemsets satisfy the minimum support of 0.5? (Hint: An itemset may include more than one item.)?
Let's calculate the support for each itemset:
Itemset {A}: Appears in transactions T1, T2, T4, T5. Support = 4/5 = 0.8 (80%)
Itemset {B}: Appears in transactions T1, T3. Support = 2/5 = 0.4 (40%)
Itemset {C}: Appears in transactions T1, T2, T3, T5. Support = 4/5 = 0.8 (80%)
Itemset {D}: Appears in transactions T4, T5. Support = 2/5 = 0.4 (40%)
Itemset {A, B}: Appears in transaction T1. Support = 1/5 = 0.2 (20%)
Itemset {A, C}: Appears in transactions T1, T2, T5. Support = 3/5 = 0.6 (60%)
Itemset {A, D}: Appears in transactions T4, T5. Support = 2/5 = 0.4 (40%)
Itemset {B, C}: Appears in transactions T1, T3. Support = 2/5 = 0.4 (40%)
Itemset {C, D}: Appears in transaction T5. Support = 1/5 = 0.2 (20%)
Itemset {A, B, C}: Appears in transaction T1. Support = 1/5 = 0.2 (20%)
Itemset {A, C, D}: Appears in transaction T5. Support = 1/5 = 0.2 (20%)
Itemsets that satisfy the minimum support of 0.5 are: {A} {C} {A, C}
• How are interesting rules identified? How are interesting rules distinguished from coincidental rules?
Interesting rules are identified based on their significance and relevance to the analysis objectives. Several
measures are commonly used to evaluate the interestingness of association rules, including support, confidence, lift,
and leverage.
Support measures the frequency or occurrence of an itemset in the dataset. Rules with higher support values indicate
that the itemset occurs frequently and are more likely to be interesting.
Confidence measures the reliability or trustworthiness of a rule. It represents the conditional probability that the
consequent of the rule holds true given the antecedent. High-confidence rules suggest a strong association between
the antecedent and consequent.
Lift compares the observed frequency of the rule's antecedent and consequent occurring together to the expected
frequency under independence. A lift value greater than 1 indicates a meaningful relationship between the items and
suggests an interesting rule.
Leverage measures the difference between the observed frequency of the rule and the expected frequency if the
items were independent. A positive leverage value indicates a non-random relationship and suggests an interesting
rule.
To distinguish interesting rules from coincidental rules, a combination of these measures is employed. Rules with high
support, confidence, lift, and leverage are considered more likely to be meaningful and interesting. Additionally,
domain knowledge and human insights play a crucial role in evaluating the relevance and significance of rules. By
considering these measures and incorporating expert knowledge, analysts can filter out coincidental rules and focus
on those that provide valuable insights and actionable information.
Thanks