06 Apriori

Data Mining:
Concepts and Techniques

(3rd ed.)
— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
1
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
2
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?
 Freq. pattern: An intrinsic and important property of
datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data

 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
4
Basic Concepts: Frequent Patterns
Tid Items bought  itemset: A set of one or more

10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs  (absolute) support, or, support
40 Nuts, Eggs, Milk
count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk
occurrence of an itemset X
Customer Customer
 (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
Customer
buys beer
threshold
5
Basic Concepts: Association Rules
Tid Items bought  Find all the rules X  Y with
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
minimum support and confidence
30 Beer, Diaper, Eggs  support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
 confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
diaper
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer  Association rules: (many more!)
 Beer  Diaper (60%, 100%)
 Diaper  Beer (60%, 75%)
6
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by
Bayardo @ SIGMOD’98)
 Closed pattern is a lossless compression of freq. patterns
 Reducing the # of patterns and rules
7
March 15, 2024 Data Mining: Concepts and Techniques 8
Closed Patterns and Max-Patterns
 Exercise: Suppose a DB contains only two transactions
 <a1, …, a100>, <a1, …, a50>
 Let min_sup = 1
 What is the set of closed itemset?
 {a1, …, a100}: 1
 {a1, …, a50}: 2
 What is the set of max-pattern?
 {a1, …, a100}: 1
 What is the set of all patterns?
 {a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1
 A big number: 2100 - 1? Why? 9
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
10
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test
Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical
Data Format
11
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
13
The Apriori Algorithm (Pseudo-Code)
14
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
16
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical Data Format
 Mining Close Frequent Patterns and Maxpatterns
17
Further Improvement of the Apriori Method
 Major computational challenges

 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
21
Improvements to Apriori

Improvements to Apriori

Partition: Scan Database Only Twice
 Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent
patterns
 Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1 + DB2 + + DBk = DB

sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Partitioning

Pincer Search

Pincer Search Contd.

Pincer Search Contd.

 Refer class notes for example

DIC: Reduce Number of Scans
ABCD
 Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD  Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
30
• Alternative to Apriori Itemset Generation
• Itemsets are dynamically added and deleted as
transactions are read
• Relies on the fact that for an itemset to be frequent, all
of its subsets must also be frequent, so we only
examine those itemsets whose subsets are all frequent

DIC
•Solid box: confirmed frequent itemset - an itemset we have

finished counting and exceeds the support threshold minsupp
•Solid circle: confirmed infrequent itemset - we have finished
counting and it is below minsupp
•Dashed box: suspected frequent itemset - an itemset we are
still counting that exceeds minsupp
•Dashed circle: suspected infrequent itemset - an itemset we
are still counting that is below minsupp

Sampling for Frequent Patterns
36

06 Apriori

Uploaded by

Copyright:

Available Formats

06 Apriori

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

06 Apriori

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

 Sequential, structural (e.g., sub-graph) patterns

 Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data

 Cluster analysis: frequent pattern-based clustering

Tid Items bought  itemset: A set of one or more

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

 Apriori: A Candidate Generation-and-Test

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format

 Mining Close Frequent Patterns and Maxpatterns

 Major computational challenges

March 15, 2024 Data Mining: Concepts and Techniques 22

March 15, 2024 Data Mining: Concepts and Techniques 23

 A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

DB1 + DB2 + + DBk = DB

March 15, 2024 Data Mining: Concepts and Techniques 25

March 15, 2024 Data Mining: Concepts and Techniques 26

March 15, 2024 Data Mining: Concepts and Techniques 27

March 15, 2024 Data Mining: Concepts and Techniques 28

March 15, 2024 Data Mining: Concepts and Techniques 29

March 15, 2024 Data Mining: Concepts and Techniques 31

•Solid box: confirmed frequent itemset - an itemset we have

March 15, 2024 Data Mining: Concepts and Techniques 33

You might also like