06 Apriori
06 Apriori
06 Apriori
— Chapter 6 —
Basic Concepts
Evaluation Methods
Summary
2
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?
Freq. pattern: An intrinsic and important property of
datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
4
Basic Concepts: Frequent Patterns
5
Basic Concepts: Association Rules
Tid Items bought Find all the rules X Y with
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
minimum support and confidence
30 Beer, Diaper, Eggs support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
7
March 15, 2024 Data Mining: Concepts and Techniques 8
Closed Patterns and Max-Patterns
Exercise: Suppose a DB contains only two transactions
<a1, …, a100>, <a1, …, a50>
Let min_sup = 1
What is the set of closed itemset?
{a1, …, a100}: 1
{a1, …, a50}: 2
What is the set of max-pattern?
{a1, …, a100}: 1
What is the set of all patterns?
{a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1
A big number: 2100 - 1? Why? 9
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
10
Scalable Frequent Itemset Mining Methods
Approach
Data Format
11
March 15, 2024 Data Mining: Concepts and Techniques 12
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
13
The Apriori Algorithm (Pseudo-Code)
14
March 15, 2024 Data Mining: Concepts and Techniques 15
Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
16
Scalable Frequent Itemset Mining Methods
17
March 15, 2024 Data Mining: Concepts and Techniques 18
March 15, 2024 Data Mining: Concepts and Techniques 19
March 15, 2024 Data Mining: Concepts and Techniques 20
Further Improvement of the Apriori Method
21
Improvements to Apriori
patterns
Scan 2: consolidate global frequent patterns
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
30
• Alternative to Apriori Itemset Generation
• Itemsets are dynamically added and deleted as
transactions are read
• Relies on the fact that for an itemset to be frequent, all
of its subsets must also be frequent, so we only
examine those itemsets whose subsets are all frequent
36