0% found this document useful (0 votes)
135 views24 pages

Dm4part1 PDF

This document discusses association rule mining and summarizes a lecture given by Dr. Sanjay Ranka on the topic. It defines association rules and the concepts of support and confidence. The goal of association rule mining is to discover all rules that have support and confidence above minimum threshold values. A two-step approach is used: 1) generate frequent itemsets and 2) generate high confidence rules from each frequent itemset. Techniques for efficiently mining frequent itemsets include reducing the number of candidates, transactions, and comparisons using the Apriori principle and candidate hashing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views24 pages

Dm4part1 PDF

This document discusses association rule mining and summarizes a lecture given by Dr. Sanjay Ranka on the topic. It defines association rules and the concepts of support and confidence. The goal of association rule mining is to discover all rules that have support and confidence above minimum threshold values. A two-step approach is used: 1) generate frequent itemsets and 2) generate high confidence rules from each frequent itemset. Techniques for efficiently mining frequent itemsets include reducing the number of candidates, transactions, and comparisons using the Apriori principle and candidate hashing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

University of Florida CISE department Gator Engineering

Association Analysis
Part 1
Dr. Sanjay Ranka
Professor
Computer and Information Science and Engineering
University of Florida
University of Florida CISE department Gator Engineering

Mining Associations
• Given a set of records, find rules that will predict
the occurrence of an item based on the
occurrences of other items in the record
Market-Basket transactions
Example:

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Definition of Association Rule


Association Rule:

Support:

Confidence:

Goal: Example:
Discover all rules having
support ≥ minsup and
confidence ≥ minconf
thresholds.

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

How to Mine Association Rules?


Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the rules above correspond to the
same itemset: {Milk, Diaper, Beer}
• Rules obtained from the same
itemset have identical support but
can have different confidence
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

How to Mine Association Rules?


•  Two step approach:
1.  Generate all frequent itemsets (sets of items
whose support > minsup )
2.  Generate high confidence association rules
from each frequent itemset
–  Each rule is a binary partition of a frequent itemset

–  Frequent itemset generation is more


expensive operation

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Itemset Lattice

There are 2d possible


itemsets
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Generating Frequent Itemsets


• Naive approach:
– Each itemset in the lattice is a candidate frequent
itemset
– Count the support of each candidate by scanning
the database

–  Complexity ~ O(NM) => Expensive since M = 2d !!!


Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Computational Complexity
• Given d unique items:
–  Total number of itemsets = 2d
–  Total number of possible association rules:

If d=6, R = 602 rules

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Approach for Mining Frequent Itemsets

• Reduce the number of candidates (M)


–  Complete search: M=2d
–  Use Apriori heuristic to reduce M

• Reduce the number of transactions (N)


–  Reduce size of N as the size of itemset increases
–  Used by DHP and vertical-based mining algorithms

• Reduce the number of comparisons (NM)


–  Use efficient data structures to store the candidates or
transactions
–  No need to match every candidate against every
transaction
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Reducing Number of Candidates


• Apriori principle:
– If an itemset is frequent, then all of its subsets
must also be frequent
• Apriori principle holds due to the following
property of the support measure:

– Support of an itemset never exceeds the support


of any of its subsets
– This is known as the anti-monotone property of
support
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Using Apriori Principle for Pruning Candidates


If an itemset is infrequent,
then all of its supersets
must also be infrequent

Found to be
Infrequent

Pruned
supersets
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Illustrating Apriori Principle


Items (1-itemsets)

Pairs (2-itemsets)

(No need to generate


candidates involving Coke
or Eggs)

Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered,


6C + 6C + 6C = 41
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Reducing Number of Comparisons


• Candidate counting:
– Scan the database of transactions to determine
the support of candidate itemsets
– To reduce number of comparisons, store the
candidates using a hash structure

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Association Rule Discovery:


Hash Tree for Fast Access
Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Association Rule Discovery:


Hash Tree for Fast Access
Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Association Rule Discovery:


Hash Tree for Fast Access
Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Candidate Counting
• Given a transaction L = {1,2,3,5,6}
• Possible subsets of size 3:
{1,2,3} {2,3,5} {3,5,6}
{1,2,5} {2,3,6}
{1,2,6} {2,5,6}
{1,3,5}
{1,3,6}
{1,5,6}
• If width of transaction is w, there are 2w-1
possible non-empty subsets
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Association Rule Discovery: Subset Operation

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Association Rule Discovery: Subset Operation …


Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
–  If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
A →BCD, B →ACD, C →ABD, D →ABC
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,

• If |L| = k, then there are 2k – 2 candidate


association rules (ignoring L → ∅ and ∅ → L)

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Rule Generation
• How to efficiently generate rules from frequent
itemsets?
–  In general, confidence does not have an anti-
monotone property
–  But confidence of rules generated from the same
itemset has an anti-monotone property
–  L = {A,B,C,D}:

c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)

•  Confidence is non-increasing as number of items in rule


consequent increases

Data Mining Sanjay Ranka Spring 2011


University of Florida CISE department Gator Engineering

Rule Generation for Apriori Algorithm


Lattice of rules

• Lattice corresponds to partial order of items in the rule consequent


Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Rule Generation for Apriori Algorithm …

• Candidate rule is generated by merging


two rules that share the same prefix
in the rule consequent
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
Data Mining Sanjay Ranka Spring 2011
University of Florida CISE department Gator Engineering

Other Frequent Itemset Algorithms


• Traversal of Itemset Lattice
– Apriori uses breadth-first (level-wise)
traversal

• Representation of Database
– Apriori uses horizontal data layout

• Generate-and-count paradigm

Data Mining Sanjay Ranka Spring 2011

You might also like