Big Data - Week04 - Association Rules
Big Data - Week04 - Association Rules
12 per
4 bytes per
occurring
pair
pair
Counts of
pairs of
memory
frequent items
Main
(candidate
pairs)
Pass 1 Pass 2
memory
ofpairs
frequent
of
◼ Trick: re-number frequent
Main
itemsitems
frequent
items 1,2,… and keep a
table relating new numbers Pass 1 Pass 2
to original item numbers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Frequent Triples, Etc.
◼ For each k, we construct two sets of
k-tuples (sets of size k):
▪ Ck = candidate k-tuples = those that might be frequent sets
(support > s) based on information from the pass for k–1
▪ Lk = the set of truly frequent k-tuples
Count All pairs Count To be
All
the items of items the pairs explained
items
from L1
Example
from Lk-1 and L1.
But that one can be more careful with candidate generation.
For example, in C3 we know {b,m,j} cannot be frequent since
{m,j} is not frequent
◼ Hypothetical steps of the A-Priori algorithm
▪ C1 = { {b} {c} {j} {m} {n} {p} }
▪ Count the support of itemsets in C1
▪ Prune non-frequent: L1 = { b, c, j, m }
▪ Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
▪ Count the support of itemsets in C2
▪ Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
▪ Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} } **
▪ Count the support of itemsets in C3
▪ Prune non-frequent: L3 = {J. Ullman:
J. Leskovec, A. Rajaraman,
{b,c,m} }
Mining of Massive Datasets, http://www.mmds.org 31
A-Priori for All Frequent Itemsets
◼ One pass for each k (itemset size)
◼ Needs room in main memory to count
each candidate k–tuple
◼ For typical market-basket data and reasonable support (e.g., 1%), k = 2
requires the most memory
◼ Many possible extensions:
▪ Association rules with intervals:
▪ For example: Men over 65 have 2 cars
▪ Association rules when items are in a taxonomy
▪ Bread, Butter → FruitJam
▪ BakedGoods, MilkProduct → PreservedGoods
▪ Lower the support s as itemset gets bigger
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
PCY (Park-Chen-Yu) Algorithm
PCY (Park-Chen-Yu) Algorithm
◼ Observation:
In pass 1 of A-Priori, most memory is idle
▪ We store only individual item counts
▪ Can we use the idle memory to reduce
memory required in pass 2?
◼ Pass 1 of PCY: In addition to item counts, maintain a hash table
with as many
buckets as fit in memory
▪ Keep a count for each bucket into which
pairs of items are hashed
▪ For each bucket just keep the count, not the actual
pairs that hash to the bucket!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
PCY Algorithm – First Pass
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in
hash the pair to a bucket;
PCY
add 1 to the count for that bucket;
◼ Few things to note:
▪ Pairs of items need to be generated from the input file; they
are not present in the file
▪ We are not just interested in the presence of a pair, but we
need to see whether it is present at least s (support) times
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
Observations about Buckets
◼ Observation: If a bucket contains a frequent pair, then the bucket is
surely frequent
◼ However, even without any frequent pair,
a bucket can still be frequent ☹
▪ So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket
◼ But, for a bucket with total count less than s,
none of its pairs can be frequent ☺
▪ Pairs that hash to this bucket can be eliminated as candidates (even if the pair
consists of 2 frequent items)
◼ Pass 2:
Only count pairs that hash to frequent buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
PCY Algorithm – Between Passes
◼Replace the buckets by a bit-vector:
▪ 1 means the bucket count exceeded the support s
(call it a frequent bucket); 0 means it did not
Main memory
Item counts Frequent items
Bitmap
Hash
Hash table Counts of
table
for pairs candidate
pairs
Pass 1 Pass 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
Frequent Itemsets
in < 2 Passes
Frequent Itemsets in < 2 Passes
◼A-Priori, PCY, etc., take k passes to find frequent
itemsets of size k
◼Can we use fewer passes?
◼Use 2 or fewer passes for all sizes,
but may miss some frequent itemsets
▪ Random sampling (often a cure for having too much data)
▪ SON (Savasere, Omiecinski, and Navathe)
Main memory
▪ So we don’t pay for disk I/O each baskets
time we increase the size of itemsets
▪ Reduce support threshold Space
for
proportionally to match the sample size counts