MS (Data Science) Fall 2020 Semester
MS (Data Science) Fall 2020 Semester
Course Teacher
Books
• “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Market-Basket Data
• A large set of items, e.g., things sold in a
supermarket.
• A large set of baskets, each of which is a small
subset of the items, e.g., the things one customer
buys on one day.
Items: {Bread, Milk, Diaper, Beer, Eggs, Coke}
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
Baskets: Transactions
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
8
Frequent itemsets
• Goal: find combinations of items (itemsets) that
occur frequently
• Called Frequent Itemsets
Support : number of
TID Items transactions that contain
1 Bread, Milk
itemset I
2 Bread, Diaper, Beer, Eggs Examples of frequent itemsets ≥ 3
3 Milk, Diaper, Beer, Coke {Bread}: 4
4 Bread, Milk, Diaper, Beer {Milk} : 4
5 Bread, Milk, Diaper, Coke {Diaper} : 4
{Beer}: 3
{Diaper, Beer} : 3
{Milk, Bread} : 3
9
Market-Baskets – (2)
• Really, a general many-to-many mapping
(association) between two kinds of things, where the
one (the baskets) is a set of the other (the items)
• But we ask about connections among “items,” not “baskets.”
Applications – (1)
• Items = products; baskets = sets of products
someone bought in one trip to the store.
Applications – (2)
• Baskets = Web pages; items = words.
Applications – (3)
• Baskets = sentences; items = documents
containing those sentences.
• Problem parameters:
• N (size): number of transactions
• Wallmart: billions of baskets per year
• Web: billions of pages
• d (dimension): number of (distinct) items
• Wallmart sells more than 100,000 items
• Web: billions of words
• w: max size of a basket
• M: Number
M =2𝑑 of possible itemsets.
15
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
A Naïve Algorithm
• Brute-force approach: Every itemset is a candidate :
• Consider all itemsets in the lattice, and scan the data for each candidate to
compute the support
• OR
• Scan the data, and for each transaction generate all possible itemsets.
Keep a count for each itemset in the data.
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
17
Computation Model
• Typically, data is kept in flat files rather than in a
database system.
• Stored on disk.
• Stored basket-by-basket.
• We can expand a basket into pairs, triples, etc. as we read
the data.
• Use k nested loops, or recursion to generate all itemsets of size k.
Main-Memory Bottleneck
• For many frequent-itemset algorithms, main
memory is the critical resource.
• As we read baskets, we need to count something, e.g.,
occurrences of pairs.
• The number of different things we can count is limited
by main memory.
• Swapping counts in/out is too slow
21
Found to be frequent
24
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent supersets
ABCDE
Pruned
25
1. k = 1, C1 = all items
2. While Ck not empty
Frequent
itemset 3. Scan the database to find which itemsets in
generation Ck are frequent and put them into Lk
Candidate 4. Generate the candidate itemsets Ck+1 of
generation
size k+1 using Lk
5. k = k+1
R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules",
Proc. of the 20th Int'l Conference on Very Large Databases, 1994.
26
Candidate Generation
• Apriori principle:
• An itemset of size k+1 is candidate to be frequent only if
all of its subsets of size k are known to be frequent
Candidate generation:
• Construct a candidate of size k+1 by combining
frequent itemsets of size k
• If k = 1, take the all pairs of frequent items
• If k > 1, join pairs of itemsets that differ by just one item
• For each generated candidate itemset ensure that all
subsets of size k are frequent.
27
Example
• L3={abc, abd, acd, ace, bcd}
• Self-join: L3*L3
item1 item2 item3 item1 item2 item3
a b c a b c
a b d a b d
a c d a c d
a c e a c e
b c d b c d
31
Example
• L3={abc, abd, acd, ace, bcd}
• Self-join: L3*L3
item1 item2 item3 item1 item2 item3
a b c a b c
a b d a b d
a c d a c d
a c e a c e
b c d b c d
32
Example
• L3={abc, abd, acd, ace, bcd}
a c e a c e
{a,b,c,d}
b c d b c d
Example
• L3={abc, abd, acd, ace, bcd}
Number of Combinations
35
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
36