0% found this document useful (0 votes)
10 views46 pages

Big Data - Week04 - Association Rules

The document discusses frequent itemset mining and association rule discovery, particularly in the context of supermarket sales data to identify items frequently bought together. It outlines the market-basket model, the importance of support and confidence in association rules, and various applications including retail and healthcare. Additionally, it describes algorithms for finding frequent itemsets and the challenges associated with processing large datasets efficiently.

Uploaded by

pand4inca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

Big Data - Week04 - Association Rules

The document discusses frequent itemset mining and association rule discovery, particularly in the context of supermarket sales data to identify items frequently bought together. It outlines the market-basket model, the importance of support and confidence in association rules, and various applications including retail and healthcare. Additionally, it describes algorithms for finding frequent itemsets and the challenges associated with processing large datasets efficiently.

Uploaded by

pand4inca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Frequent Itemset

Mining & Association


Rules

Slides from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org


Association Rule Discovery
Supermarket shelf management – Market-basket model:
◼ Goal: Identify items that are bought together by
sufficiently many customers
◼ Approach: Process the sales data collected with
barcode scanners to find dependencies among items
◼ A classic rule:
▪ If someone buys diaper and milk, then he/she is
likely to buy beer
▪ Don’t be surprised if you find six-packs next to diapers!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
The Market-Basket Model
◼ A large set of items Inpu
▪ e.g., things sold in a t:
supermarket
◼ A large set of baskets
◼ Each basket is a
small subset of items
▪ e.g., the things one Outp
customer buys on one day ut:
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
◼ Want to discover {Diaper,
{Diaper,Milk}
Milk}-->
-->
association rules {Beer}
{Beer}
▪ People who bought {x,y,z} tend to buy {v,w}
▪ Amazon!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Applications – (1)
◼ Items = products; Baskets = sets of products someone
bought in one trip to the store
◼ Real market baskets: Chain stores keep TBs of data about
what customers buy together
▪ Tells how typical customers navigate stores, lets them position
tempting items
▪ Suggests tie-in “tricks”, e.g., run sale on diapers
and raise the price of beer
▪ Need the rule to occur frequently, or no $$’s
◼ Amazon’s people who bought X also bought Y
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Applications – (2)
◼ Baskets = sentences; Items = documents containing
those sentences
▪ Items that appear together too often could represent plagiarism
▪ Notice items do not have to be “in” baskets

◼ Baskets = patients; Items = drugs & side-effects


▪ Has been used to detect combinations
of drugs that result in particular side-effects
▪ But requires extension: Absence of an item
needs to be observed as well as presence
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Outline
First: Define
Frequent itemsets
Association rules:
Confidence, Support, Interestingness
Then: Algorithms for finding frequent itemsets
Finding frequent pairs
A-Priori algorithm
PCY algorithmJ. Leskovec,
+ 2 refinements
A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Frequent Itemsets
◼Simplest question: Find sets of items that appear
together “frequently” in baskets
◼Support for itemset I: Number of baskets containing all
items in I
▪ (Often expressed as a fraction
of the total number of baskets)
◼Given a support threshold s,
then sets of items that appear
in at least s baskets are called Support of
frequent itemsets {Beer, Bread} = 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
Example: Frequent Itemsets
◼Items = {milk, coke, pepsi, beer, juice}
◼Support threshold = 3 baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

◼Frequent itemsets: {m}, {c}, {b}, {j},


{m,b , ,
} {b,c}{c,j}.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
Association Rules
◼Association Rules:
If-then rules about the contents of baskets
◼{i1, i2,…,ik} → j means: “if a basket contains all of
i1,…,ik then it is likely to contain j”
◼In practice there are many rules, want to find
significant/interesting ones!
◼Confidence of this association rule is the
probability of j given I = {i1,…,ik}

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9


Interesting Association Rules
◼Not all high-confidence rules are interesting
▪ The rule X → milk may have high confidence for many itemsets X,
because milk is just purchased very often (independent of X) and
the confidence will be high
◼Interest of an association rule I → j:
difference between its confidence and the fraction of
baskets that contain j
▪ Interesting rules are those with high positive or negative interest values
(usually above 0.5)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10


Example: Confidence and Interest
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

◼ Association rule: {m, b} →c


▪ Confidence = 2/4 = 0.5
▪ Interest = |0.5 – 5/8| = 1/8
▪ Item c appears in 5/8 of the baskets
▪ Rule is not very interesting!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
Finding Association Rules
◼Problem: Find all association rules with
support ≥s and confidence ≥c
▪ Note: Support of an association rule is the support
of the set of items on both sides
◼Hard part: Finding the frequent itemsets!
▪ If {i1, i2,…, ik} → j has high support and
confidence, then both {i1, i2,…, ik} and
{i1, i2,…,ik, j} will be “frequent”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
Mining Association Rules
◼Step 1: Find all frequent itemsets I
▪ (we will explain this next)
◼Step 2: Rule generation
▪ For every subset A of I, generate a rule A → I \ A
▪ Since I is frequent, A is also frequent
▪ Variant 1: Single pass to compute the rule confidence
▪ confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
▪ Variant 2:
▪ Observation: If A,B,C→D is below confidence, so is A,B→C,D
▪ Can generate “bigger” rules from smaller ones!
▪ Output the rules above the confidence threshold
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, c, b, n} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}

◼ Support threshold s = 3, confidence c = 0.75


◼ 1) Frequent itemsets:
▪ {b,m} {b,c} {c,m} {c,j} {m,c,b}
◼ 2) Generate rules:
▪ b→m: c=4/6 b→c: c=5/6 b,c→m: c=3/5
▪ m→b: c=4/5 … b,m→c: c=3/4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
Finding Frequent
Itemsets
Itemsets: Computation Model
◼ Back to finding frequent itemsets Ite
Ite
m
Ite
m
◼ Typically, data is kept in flat files Ite
m
Ite
m
rather than in a database system: Ite
m
m
Ite
▪ Stored on disk Ite
m
m
Ite
Ite
m
▪ Stored basket-by-basket m
Ite
Ite
m
▪ Baskets are small but we have m

many baskets and many items Etc


▪ Expand baskets into pairs, triples, etc. .
as you read baskets
▪ Use k nested loops to generate all Items are positive integers,
Note: We want to find frequent
sets of size k itemsets. To find them, we and boundaries between
have to count them. To count them, we have to generate them. baskets are –1. 16
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Computation Model
◼The true cost of mining disk-resident data is
usually the number of disk I/Os
◼In practice, association-rule algorithms read
the data in passes – all baskets read in turn
◼We measure the cost by the number of
passes an algorithm makes over the data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
Main-Memory Bottleneck
◼For many frequent-itemset algorithms,
main-memory is the critical resource
▪ As we read baskets, we need to count
something, e.g., occurrences of pairs of items
▪ The number of different things we can count
is limited by main memory
▪ Swapping counts in/out is a disaster (why?)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18


Finding Frequent Pairs
◼ The hardest problem often turns out to be finding the
frequent pairs of items {i1, i2}
▪ Why? Freq. pairs are common, freq. triples are rare
▪ Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size
◼ Let’s first concentrate on pairs, then extend to larger sets
◼ The approach:
▪ We always need to generate all the itemsets
▪ But we would only like to count (keep track) of those itemsets
that in the end turn out to be frequent
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Naïve Algorithm
◼ Naïve approach to finding frequent pairs
◼ Read file once, counting in main memory
the occurrences of each pair:
▪ From each basket of n items, generate its
n(n-1)/2 pairs by two nested loops
◼ Fails if (#items)2 exceeds main memory
▪ Remember: #items can be
100K (Wal-Mart) or 10B (Web pages)
▪ Suppose 105 items, counts are 4-byte integers
▪ Number of pairs of items: 105(105-1)/2 = 5*109
▪ Therefore, 2*1010 (20 gigabytes) of memory needed
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Counting Pairs in Memory
Two approaches:
◼ Approach 1: Count all pairs using a matrix
◼ Approach 2: Keep a table of triples [i, j, c] = “the count of the
pair of items {i, j} is c.”
▪ If integers and item ids are 4 bytes, we need approximately 12 bytes
for pairs with count > 0
▪ Plus some additional overhead for the hashtable
Note:
◼ Approach 1 only requires 4 bytes per pair
◼ Approach 2 uses 12 bytes per pair
(but only for pairs with count > 0)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
Comparing the 2 Approaches

12 per
4 bytes per
occurring
pair
pair

Triangular Matrix Triples

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22


Comparing the two approaches
◼ Approach 1: Triangular Matrix
▪ n = total number items
▪ Count pair of items {i, j} only if i<j
▪ Keep pair counts in lexicographic order:
▪ {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
▪ Pair {i, j} is at position (i –1)(n– i/2) + j –1
▪ Total number of pairs n(n –1)/2; total bytes = 2n2
▪ Triangular Matrix requires 4 bytes per pair
◼ Approach 2 uses 12 bytes per occurring pair
(but only for pairs with count > 0)
▪ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
Comparing the two approaches
◼ Approach 1: Triangular Matrix
Problem is if we
▪ n = total number items
▪ Count pair of items {i, j} only if i<j
have too many items
▪ Keep pair counts in lexicographic order:
▪ {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
so (ithe
▪ Pair {i, j} is at position –1)(n– i/2)pairs
+ j –1
do not fit
▪ Total number of pairs n(n –1)/2; total bytes= 2n
▪ Triangular Matrix requires 4 bytes per pair
into 2

◼ Approach 2 uses 12 memory. bytes per pair


(but only for pairs with count > 0)
Can we do better?
▪ Beats Approach 1 if less than 1/3 of
possible pairs actually occur
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
A-Priori Algorithm
A-Priori Algorithm – (1)
◼ A two-pass approach called
A-Priori limits the need for
main memory
◼ Key idea: monotonicity
▪ If a set of items I appears at
least s times, so does every subset J of I
◼ Contrapositive for pairs:
If item i does not appear in s baskets, then no pair including
i can appear in s baskets
◼ So, how does A-Priori find freq. pairs?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
A-Priori Algorithm – (2)
◼ Pass1: Read Baskets and count in main memory the
occurrences of each individual item
▪ Requires only memory proportional to #items

◼ Items that appear >= s times are the frequent items


◼ Pass2: Read baskets again and count in main memory only
those pairs where both elements are frequent (from pass1)
▪Requires memory proportional to the square of frequent
items only (for counts)
▪Plus a list of frequent items
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
Main-Memory: Picture of A-Priori

Item counts Frequent items

Counts of
pairs of
memory

frequent items
Main

(candidate
pairs)

Pass 1 Pass 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28


Detail for A-Priori
◼ You can use the triangular
matrix method with n =
Item counts Frequent Old
number of frequent items items Item #s
▪ May save space compared
Counts
Countsof of
pairs
with storing triples

memory
ofpairs
frequent
of
◼ Trick: re-number frequent

Main
itemsitems
frequent
items 1,2,… and keep a
table relating new numbers Pass 1 Pass 2
to original item numbers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Frequent Triples, Etc.
◼ For each k, we construct two sets of
k-tuples (sets of size k):
▪ Ck = candidate k-tuples = those that might be frequent sets
(support > s) based on information from the pass for k–1
▪ Lk = the set of truly frequent k-tuples
Count All pairs Count To be
All
the items of items the pairs explained
items
from L1

C1 Filter L1 Construct C2 Filter L2 Construct C3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30


** Note here we generate new candidates by generating C k

Example
from Lk-1 and L1.
But that one can be more careful with candidate generation.
For example, in C3 we know {b,m,j} cannot be frequent since
{m,j} is not frequent
◼ Hypothetical steps of the A-Priori algorithm
▪ C1 = { {b} {c} {j} {m} {n} {p} }
▪ Count the support of itemsets in C1
▪ Prune non-frequent: L1 = { b, c, j, m }
▪ Generate C2 = { {b,c} {b,j} {b,m} {c,j} {c,m} {j,m} }
▪ Count the support of itemsets in C2
▪ Prune non-frequent: L2 = { {b,m} {b,c} {c,m} {c,j} }
▪ Generate C3 = { {b,c,m} {b,c,j} {b,m,j} {c,m,j} } **
▪ Count the support of itemsets in C3
▪ Prune non-frequent: L3 = {J. Ullman:
J. Leskovec, A. Rajaraman,
{b,c,m} }
Mining of Massive Datasets, http://www.mmds.org 31
A-Priori for All Frequent Itemsets
◼ One pass for each k (itemset size)
◼ Needs room in main memory to count
each candidate k–tuple
◼ For typical market-basket data and reasonable support (e.g., 1%), k = 2
requires the most memory
◼ Many possible extensions:
▪ Association rules with intervals:
▪ For example: Men over 65 have 2 cars
▪ Association rules when items are in a taxonomy
▪ Bread, Butter → FruitJam
▪ BakedGoods, MilkProduct → PreservedGoods
▪ Lower the support s as itemset gets bigger
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
PCY (Park-Chen-Yu) Algorithm
PCY (Park-Chen-Yu) Algorithm
◼ Observation:
In pass 1 of A-Priori, most memory is idle
▪ We store only individual item counts
▪ Can we use the idle memory to reduce
memory required in pass 2?
◼ Pass 1 of PCY: In addition to item counts, maintain a hash table
with as many
buckets as fit in memory
▪ Keep a count for each bucket into which
pairs of items are hashed
▪ For each bucket just keep the count, not the actual
pairs that hash to the bucket!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
PCY Algorithm – First Pass
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in
hash the pair to a bucket;
PCY
add 1 to the count for that bucket;
◼ Few things to note:
▪ Pairs of items need to be generated from the input file; they
are not present in the file
▪ We are not just interested in the presence of a pair, but we
need to see whether it is present at least s (support) times
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35
Observations about Buckets
◼ Observation: If a bucket contains a frequent pair, then the bucket is
surely frequent
◼ However, even without any frequent pair,
a bucket can still be frequent ☹
▪ So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket
◼ But, for a bucket with total count less than s,
none of its pairs can be frequent ☺
▪ Pairs that hash to this bucket can be eliminated as candidates (even if the pair
consists of 2 frequent items)

◼ Pass 2:
Only count pairs that hash to frequent buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
PCY Algorithm – Between Passes
◼Replace the buckets by a bit-vector:
▪ 1 means the bucket count exceeded the support s
(call it a frequent bucket); 0 means it did not

◼4-byte integer counts are replaced by bits,


so the bit-vector requires 1/32 of memory
◼Also, decide which items are frequent
and list them for the second pass
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
PCY Algorithm – Pass 2
◼ Count all pairs {i, j} that meet the
conditions for being a candidate pair:
1. Both i and j are frequent items
2. The pair {i, j} hashes to a bucket whose bit in
the bit vector is 1 (i.e., a frequent bucket)

◼ Both conditions are necessary for the


pair to have a chance of being frequent
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
Main-Memory: Picture of PCY

Main memory
Item counts Frequent items
Bitmap
Hash
Hash table Counts of
table
for pairs candidate
pairs

Pass 1 Pass 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
Frequent Itemsets
in < 2 Passes
Frequent Itemsets in < 2 Passes
◼A-Priori, PCY, etc., take k passes to find frequent
itemsets of size k
◼Can we use fewer passes?
◼Use 2 or fewer passes for all sizes,
but may miss some frequent itemsets
▪ Random sampling (often a cure for having too much data)
▪ SON (Savasere, Omiecinski, and Navathe)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41


Random Sampling (1)
◼Take a random sample of the market baskets
◼Run a-priori or one of its improvements
in main memory Copy of
sample

Main memory
▪ So we don’t pay for disk I/O each baskets
time we increase the size of itemsets
▪ Reduce support threshold Space
for
proportionally to match the sample size counts

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42


Random Sampling (2)
◼Optionally, verify that the candidate pairs are truly
frequent in the entire data set by a second pass
(avoid false positives)

◼But you don’t catch sets frequent in the whole but


not in the sample
▪ Smaller threshold, e.g., s/1.25, helps catch more truly
frequent itemsets
▪ But requires more space
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43
SON Algorithm – (1)
◼Repeatedly read small subsets of the baskets into
main memory and run an in-memory algorithm
to find all frequent itemsets
▪ Note: we are not sampling, but processing the entire
file in memory-sized chunks

◼An itemset becomes a candidate if it is found to


be frequent in any one or more subsets of the
baskets.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
SON Algorithm – (2)
◼On a second pass, count all the candidate
itemsets and determine which are frequent in
the entire set
◼Key “monotonicity” idea: an itemset cannot
be frequent in the entire set of baskets unless
it is frequent in at least one subset.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45


SON – Distributed Version
◼SON lends itself to distributed data mining
◼Baskets distributed among many nodes
▪ Compute frequent itemsets at each node
▪ Distribute candidates to all nodes
▪ Accumulate the counts of all candidates

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

You might also like