Dm4part1 PDF
Dm4part1 PDF
Association Analysis
Part 1
Dr. Sanjay Ranka
Professor
Computer and Information Science and Engineering
University of Florida
University of Florida CISE department Gator Engineering
Mining Associations
• Given a set of records, find rules that will predict
the occurrence of an item based on the
occurrences of other items in the record
Market-Basket transactions
Example:
Support:
Confidence:
Goal: Example:
Discover all rules having
support ≥ minsup and
confidence ≥ minconf
thresholds.
Observations:
• All the rules above correspond to the
same itemset: {Milk, Diaper, Beer}
• Rules obtained from the same
itemset have identical support but
can have different confidence
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering
Itemset Lattice
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
Found to be
Infrequent
Pruned
supersets
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering
Pairs (2-itemsets)
Minimum Support = 3
Triplets (3-itemsets)
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Candidate Counting
• Given a transaction L = {1,2,3,5,6}
• Possible subsets of size 3:
{1,2,3} {2,3,5} {3,5,6}
{1,2,5} {2,3,6}
{1,2,6} {2,5,6}
{1,3,5}
{1,3,6}
{1,5,6}
• If width of transaction is w, there are 2w-1
possible non-empty subsets
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
Rule Generation
• How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an anti-
monotone property
– But confidence of rules generated from the same
itemset has an anti-monotone property
– L = {A,B,C,D}:
• Representation of Database
– Apriori uses horizontal data layout
• Generate-and-count paradigm