Datamining Lect2 Frequent
Datamining Lect2 Frequent
LECTURE 2
Frequent Itemsets, Association Rules
Outline
• Market-Basket Data
• Frequent Itemsets
• Applications
• Mining Frequent Itemsets
• Itemset lattice
• A Naïve Algorithm
• The Apriori Principle
• The Apriori algorithm
• Examples
• Hash tree
• Association Rule Mining
• http://www.philippe-fournier-viger.com/spmf/Apriori.php
• http://www.philippe-fournier-viger.com/spmf/AprioriTID.php
3
Market-Basket Data
• A large set of items, e.g., things sold in a
supermarket.
• A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.
4
Market-Baskets – (2)
• Really, a general many-to-many mapping
(association) between two kinds of things, where
the one (the baskets) is a set of the other (the
items)
• But we ask about connections among “items,” not
“baskets.”
• The technology focuses on common events, not
rare events (“long tail”).
Frequent Itemsets
• Given a set of transactions, find combinations of items
(itemsets) that occur frequently
Support 𝑠 𝐼 : number of
transactions that contain
Market-Basket transactions
itemset I
Items: {Bread, Milk, Diaper, Beer, Eggs, Coke}
TID Items Examples of frequent itemsets 𝑠 𝐼 ≥ 3
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Bread}: 4
{Milk} : 4
3 Milk, Diaper, Beer, Coke {Diaper} : 4
4 Bread, Milk, Diaper, Beer {Beer}: 3
5 Bread, Milk, Diaper, Coke {Diaper, Beer} : 3
{Milk, Bread} : 3
6
Applications – (1)
• Items = products; baskets = sets of products
someone bought in one trip to the store.
Applications – (2)
• Baskets = Web pages; items = words.
Applications – (3)
• Baskets = sentences; items = documents
containing those sentences.
• Problem parameters:
• N = |T|: number of transactions
• d = |I|: number of (distinct) items
• w: max width of a transaction
• Number of possible itemsets? M = 2d
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Found to be frequent
Illustration of the Apriori principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent supersets
ABCDE
Pruned
The Apriori algorithm
Ck = candidate itemsets of size k
Level-wise approach Lk = frequent itemsets of size k
1. k = 1, C1 = all items
2. While Ck not empty
Frequent 3. Scan the database to find which itemsets in
itemset
generation Ck are frequent and put them into Lk
Candidate 4. Use Lk to generate a collection of candidate
generation itemsets Ck+1 of size k+1
5. k = k+1
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, 1994.
Illustration of the Apriori principle
TID Items
1 Bread, Milk
minsup = 3 2
3
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Item Count Items (1-itemsets) 4 Bread, Milk, Diaper, Beer
Bread 4 5 Bread, Milk, Diaper, Coke
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3
{Bread,Milk} 3
Diaper 4
{Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Triplets (3-itemsets)
If every subset is considered,
6 6 6 Itemset Count
+ + = 6 + 15 + 20 = 41 {Bread,Milk,Diaper} 2
1 2 3
With support-based pruning,
6 4 Only this triplet has all subsets to be frequent
+ + 1 = 6 + 6 + 1 = 13
1 2 But it is below the minsup threshold
Candidate Generation
• Basic principle (Apriori):
• An itemset of size k+1 is candidate to be frequent only if
all of its subsets of size k are known to be frequent
• Main idea:
• Construct a candidate of size k+1 by combining
frequent itemsets of size k
• If k = 1, take the all pairs of frequent items
• If k > 1, join pairs of itemsets that differ by just one item
• For each generated candidate itemset ensure that all subsets of
size k are frequent.
Generate Candidates Ck+1
• Assumption: The items in an itemset are ordered
• E.g., if integers ordered in increasing order, if strings ordered in
lexicographic order
• The items in Lk are also listed in an order
• self-join Lk
insert into Ck+1
select p.item1, p.item2, …, p.itemk, q.itemk
from Lk p, Lk q
where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk
Example I
• L3={abc, abd, acd, ace, bcd}
• Self-join: L3*L3
– abcd from abc and abd
– acde from acd and ace
Apriori principle
• Pruning step:
• For each candidate (k+1)-itemset create all subset k-itemsets
• Remove a candidate if it contains a subset k-itemset that is
not frequent
Example I
{a,b,c} {a,b,d}
• L3={abc, abd, acd, ace, bcd}
{a,b,c,d}
• Self-joining: L3*L3
– abcd from abc and abd abc abd acd bcd
– acde from acd and ace
• Pruning:
{a,c,d} {a,c,e}
– abcd is kept since all subset itemsets are
in L3 {a,c,d,e}
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
Recursion! 125
136
156
236
256 356
126
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
2,5,8
3+ 56
13+ 56
234
15+ 6 567 Increment the counters
145 136
345 356 367
357 368
124 159 689
125
457 458 Match transaction against 9 out of 15 candidates
First Second
pass pass
Frequent Frequent
items pairs
43
Picture of APriori
Counts of
pairs of
frequent
items
Pass 1 Pass 2
45
12 per
4 per pair
occurring pair
Triangular-Matrix Approach
• Number items 1, 2,…
• Requires table of size O(n) to convert item names to
consecutive integers.
• Count {i, j } only if i < j.
• Keep pairs in the order {1,2}, {1,3},…, {1,n },
{2,3}, {2,4},…,{2,n }, {3,4},…, {3,n },…{n -1,n }.
48
Triangular-Matrix Approach
• Find pair {i, j } at the position
(i –1)(n –i /2) + j – i.
Details of Approach #2
•
ASSOCIATION RULES
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an itemset based on the occurrence of other
itemset in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
not causality!
5 Bread, Milk, Diaper, Coke
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form
X Y, where X and Y are itemsets 1 Bread, Milk
– Example: 2 Bread, Diaper, Beer, Eggs
{Milk, Diaper} {Beer} 3 Milk, Diaper, Beer, Coke
Rule Evaluation Metrics 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
– Support (s)
Fraction of transactions that contain Example:
both X and Y
the probability P(X,Y) that X and Y
{Milk , Diaper } Beer
occur together
(Milk, Diaper, Beer ) 2
– Confidence (c) s 0.4
Measures how often items in Y |T| 5
appear in transactions that
(Milk, Diaper, Beer ) 2
contain X c 0.67
the conditional probability P(Y|X) that Y (Milk, Diaper ) 3
occurs given that X has occurred.
Association Rule Mining Task
• Input: A set of transactions T, over a set of items I
• Output: All rules with items in I having
• support ≥ minsup threshold
• confidence ≥ minconf threshold
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a partitioning of a frequent itemset into
Left-Hand-Side (LHS) and Right-Hand-Side (RHS)
CD=>AB
CD=>AB BD=>AC
BD=>AC BC=>AD
BC=>AD AD=>BC
AD=>BC AC=>BD
AC=>BD AB=>CD
AB=>CD
D=>ABC
D=>ABC C=>ABD
C=>ABD B=>ACD
B=>ACD A=>BCD
A=>BCD
Pruned
Rules
Lattice of rules created by the RHS
Rule Generation for APriori Algorithm
• Candidate rule is generated by merging two rules that
share the same prefix
in the RHS
CD->AB BD->AC
• join(CDAB,BDAC)
would produce the candidate
rule D ABC