0% found this document useful (0 votes)
10 views69 pages

Apriori

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views69 pages

Apriori

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Mining Association Rules in Large

Databases
■ Association rule mining
■ Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases
■ Mining various kinds of association/correlation rules
■ Constraint-based association mining
■ Sequential pattern mining
■ Applications/extensions of frequent pattern mining
■ Summary

October 14, 2023 Data Mining: Concepts and Techniques 2


What Is Association Mining?
■ Association rule mining:
■ Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
■ Frequent pattern: pattern (set of items, sequence,
etc.) that occurs frequently in a database
■ Motivation: finding regularities in data
■ What products were often purchased together? — Beer
and diapers?!
■ What are the subsequent purchases after buying a PC?

■ What kinds of DNA are sensitive to this new drug?

■ Can we automatically classify web documents?

October 14, 2023 Data Mining: Concepts and Techniques 3


Why Is Frequent Pattern or Assoiciation
Mining an Essential Task in Data Mining?
■ Foundation for many essential data mining tasks
■ Association, correlation, causality
■ Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association
■ Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
■ Broad applications
■ Basket data analysis, cross-marketing, catalog design,
sale campaign analysis
■ Web log (click stream) analysis, DNA sequence analysis,
etc.
4
Basic Concepts: Frequent Patterns and
Association Rules

■ Itemset X={x1, …, xk}


Transaction-id Items bought
10 A, B, C ■ Find all the rules X Y with min
confidence and support
20 A, C
30 A, D ■ support, s, probability that a
40 B, E, F transaction contains X∪Y
■ confidence, c, conditional
Customer Customer
probability that a transaction
buys both buys diaper having X also contains Y.

Let min_support = 50%,


min_conf = 50%:
A C (50%, 66.7%)
Customer
buys beer C A (50%, 100%)
10
Mining Association Rules—an Example

Transaction-id Items bought Min. support 50%


10 A, B, C Min. confidence 50%
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
For rule A ⇒ C: {A, C} 50%

support = support({A}∪{C}) = 50%


confidence = support({A}∪{C})/support({A})
= 66.6%

October 14, 2023 Data Mining: Concepts and Techniques 11


The Apriori Algorithm—An Example
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E 1st scan {D} 1
{C} 3
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E}
{B, C, E} 2
October 14, 2023 Data Mining: Concepts and Techniques 13
The Apriori Algorithm
■ Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;

October 14, 2023 Data Mining: Concepts and Techniques 14


Important Details of Apriori
■ How to generate candidates?
■ Step 1: self-joining Lk
■ Step 2: pruning
■ How to count supports of candidates?
■ Example of Candidate-generation
■ L3={abc, abd, acd, ace, bcd}
■ Self-joining: L3*L3
■ abcd from abc and abd
■ acde from acd and ace
■ Pruning:
■ acde is removed because ade is not in L3
■ C4={abcd}

October 14, 2023 Data Mining: Concepts and Techniques 15


How to Generate Candidates?

■ Suppose the items in Lk-1 are listed in an order


■ Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
■ Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

October 14, 2023 Data Mining: Concepts and Techniques 16


How to Count Supports of Candidates?

■ Why counting supports of candidates a problem?


■ The total number of candidates can be very huge
■ One transaction may contain many candidates
■ Method:
■ Candidate itemsets are stored in a hash-tree
■ Leaf node of hash-tree contains a list of itemsets and
counts
■ Interior node contains a hash table
■ Subset function: finds all the candidates contained in
a transaction

October 14, 2023 Data Mining: Concepts and Techniques 17


Apriori: A Candidate Generation-and-test Approach

■ Any subset of a frequent itemset must be frequent


■ if {beer, diaper, nuts} is frequent, so is {beer,
diaper}
■ Every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
■ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
■ Method:
■ generate length (k+1) candidate itemsets from length k
frequent itemsets, and
■ test the candidates against DB

■ The performance studies show its efficiency and scalability


■ Agrawal & Srikant 1994, Mannila, et al. 1994
26
Challenges of Frequent Pattern Mining

■ Challenges
■ Multiple scans of transaction database
■ Huge number of candidates
■ Tedious workload of support counting for
candidates
■ Improving Apriori: general ideas
■ Reduce passes of transaction database scans
■ Shrink number of candidates
■ Facilitate support counting of candidates
October 14, 2023 Data Mining: Concepts and Techniques 27
DIC: Reduce Number of Scans

ABCD
■ Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD ■ Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules
for market basket data. In
SIGMOD’97
October 14, 2023 Data Mining: Concepts and Techniques 28
Partition: Scan Database Only Twice

■ Any itemset that is potentially frequent in DB


must be frequent in at least one of the partitions
of DB
■ Scan 1: partition database and find local

frequent patterns
■ Scan 2: consolidate global frequent patterns

■ A. Savasere, E. Omiecinski, and S. Navathe. An


efficient algorithm for mining association in large
databases. In VLDB’95

October 14, 2023 Data Mining: Concepts and Techniques 29


Sampling for Frequent Patterns

■ Select a sample of original database, mine frequent


patterns within sample using Apriori
■ Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
■ Example: check abcd instead of ab, ac, …, etc.

■ Scan database again to find missed frequent patterns


■ H. Toivonen. Sampling large databases for association
rules. In VLDB’96

October 14, 2023 Data Mining: Concepts and Techniques 30


DHP: Reduce the Number of Candidates

■ A k-itemset whose corresponding hashing bucket count is


below the threshold cannot be frequent
■ Candidates: a, b, c, d, e

■ Hash entries: {ab, ad, ae} {bd, be, de} …

■ Frequent 1-itemset: a, b, d, e

■ ab is not a candidate 2-itemset if the sum of count of

{ab, ad, ae} is below support threshold


■ J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95

October 14, 2023 Data Mining: Concepts and Techniques 31


Eclat/MaxEclat and VIPER: Exploring Vertical
Data Format

■ Use tid-list, the list of transaction-ids containing an itemset


■ Compression of tid-lists
■ Itemset A: t1, t2, t3, sup(A)=3
■ Itemset B: t2, t3, t4, sup(B)=3
■ Itemset AB: t2, t3, sup(AB)=2
■ Major operation: intersection of tid-lists
■ M. Zaki et al. New algorithms for fast discovery of
association rules. In KDD’97
■ P. Shenoy et al. Turbo-charging vertical mining of large
databases. In SIGMOD’00

October 14, 2023 Data Mining: Concepts and Techniques 32


Bottleneck of Frequent-pattern Mining

■ Multiple database scans are costly


■ Mining long patterns needs many passes of
scanning and generates lots of candidates
■ To find frequent itemset i i …i
12 100
■ # of scans: 100
■ # of Candidates: ( 1001) + (1002) + … + (100100) =
2100-1 = 1.27*1030 !
■ Bottleneck: candidate-generation-and-test
■ Can we avoid candidate generation?

October 14, 2023 Data Mining: Concepts and Techniques 33


Mining Frequent Patterns Without
Candidate Generation

■ Grow long patterns from short ones using local


frequent items
■ “abc” is a frequent pattern
■ Get all transactions having “abc”: DB|abc
■ “d” is a local frequent item in DB|abc abcd is
a frequent pattern

October 14, 2023 Data Mining: Concepts and Techniques 40


Mining Frequent Patterns With FP-trees
■ Idea: Frequent pattern growth
■ Recursively grow frequent patterns by pattern and
database partition
■ Method
■ For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
■ Repeat the process on each newly created conditional

FP-tree
■ Until the resulting FP-tree is empty, or it contains only one

path—single path will generate all the combinations of its


sub-paths, each of which is a frequent pattern

October 14, 2023 Data Mining: Concepts and Techniques 41


Scaling FP-growth by DB Projection

■ FP-tree cannot fit in memory?—DB projection


■ First partition a database into a set of projected DBs
■ Then construct and mine FP-tree for each projected
DB
■ Parallel projection vs. Partition projection techniques
■ Parallel projection is space costly

October 14, 2023 Data Mining: Concepts and Techniques 42


October 14, 2023 Data Mining: Concepts and Techniques 43
October 14, 2023 Data Mining: Concepts and Techniques 44
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3
order, f-list a:3 p:1
m 3
3. Scan DB again, p 3
m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
October 14, 2023 Data Mining: Concepts and Techniques 45
Benefits of the FP-tree Structure

■ Completeness
■ Preserve complete information for frequent pattern

mining
■ Never break a long pattern of any transaction

■ Compactness
■ Reduce irrelevant info—infrequent items are gone

■ Items in frequency descending order: the more

frequently occurring, the more likely to be shared


■ Never be larger than the original database (not count

node-links and the count field)


■ For Connect-4 DB, compression ratio could be over 100

October 14, 2023 Data Mining: Concepts and Techniques 46


Partition Patterns and Databases

■ Frequent patterns can be partitioned into subsets


according to f-list
■ F-list=f-c-a-b-m-p

■ Patterns containing p

■ Patterns having m but no p

■ …

■ Patterns having c but no a nor b, m, p

■ Pattern f

■ Completeness and non-redundency

October 14, 2023 Data Mining: Concepts and Techniques 47


Find Patterns Having P From P-conditional Database
■ Starting at the frequent item header table in the FP-tree
■ Traverse the FP-tree by following the link of each frequent item p
■ Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 itemcond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
October 14, 2023 Data Mining: Concepts and Techniques 48
From Conditional Pattern-bases to Conditional FP-trees

■ For each pattern-base


■ Accumulate the count for each item in the base

■ Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,
a 3 fm, cm, am,
f:3
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
m:2 b:1
p 3
p:2 m:1 a:3
m-conditional FP-tree
October 14, 2023 Data Mining: Concepts and Techniques 49
Recursion: Mining Each Conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree

October 14, 2023 Data Mining: Concepts and Techniques 50


A Special Case: Single Prefix Path in FP-tree

■ Suppose a (conditional) FP-tree T has a shared single prefix-path P


■ Mining can be decomposed into two parts
■ Reduction of the single prefix path into one node
{} ■ Concatenation of the mining results of the two parts

a1:n1
a2:n2

a3:n3
{} r1

b1:m1 C1:k1 a1:n1


r1 = + b1:m1 C1:k1
a2:n2
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
October 14, 2023 Data Mining: Concepts and Techniques 51
Partition-based Projection

Tran. DB
■ Parallel projection needs a lot fcamp
of disk space fcabm
fb
■ Partition projection saves it
cbp
fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
October 14, 2023 Data Mining: Concepts and Techniques 52
FP-Growth vs. Apriori: Scalability With the Support
Threshold

Data set T25I20D10K

October 14, 2023 Data Mining: Concepts and Techniques 53


FP-Growth vs. Tree-Projection: Scalability with
the Support Threshold

Data set T25I20D100K

October 14, 2023 Data Mining: Concepts and Techniques 54


Why Is FP-Growth the Winner?

■ Divide-and-conquer:
■ decompose both the mining task and DB according to
the frequent patterns obtained so far
■ leads to focused search of smaller databases
■ Other factors
■ no candidate generation, no candidate test
■ compressed database: FP-tree structure
■ no repeated scan of entire database
■ basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching

October 14, 2023 Data Mining: Concepts and Techniques 55


Extension of Pattern Growth Mining Methodology

■ Mining closed frequent itemsets and max-patterns


■ CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
■ Mining sequential patterns
■ PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
■ Mining graph patterns
■ gSpan (ICDM’02), CloseGraph (KDD’03)
■ Constraint-based mining of frequent patterns
■ Convertible constraints (ICDE’01), gPrune (PAKDD’03)
■ Computing iceberg data cubes with complex measures
■ H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
■ Pattern-growth-based Clustering
■ MaPle (Pei, et al., ICDM’03)
■ Pattern-Growth-Based Classification
■ Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)

56
ECLAT: Mining by Exploring Vertical Data Format
■ Vertical format: t(AB) = {T11, T25, …}
■ tid-list: list of trans.-ids containing an itemset
■ Deriving frequent patterns based on vertical intersections
■ t(X) = t(Y): X and Y always happen together
■ t(X) ⊂ t(Y): transaction having X always has Y
■ Using diffset to accelerate mining
■ Only keep track of differences of tids
■ t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
■ Diffset (XY, X) = {T2}
■ Eclat (Zaki et al. @KDD’97)
■ Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)

57

You might also like