0% found this document useful (0 votes)
43 views97 pages

Association Rule Mining

The document discusses association rule mining, which aims to find relationships between items in transactional data by discovering rules that predict the occurrence of an item based on other items. Association rule mining involves two steps: first generating all frequent itemsets whose support is above a minimum threshold, and then generating high confidence rules from each frequent itemset. Frequent itemset generation is computationally expensive as it requires considering all possible combinations of items.

Uploaded by

Onteru Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views97 pages

Association Rule Mining

The document discusses association rule mining, which aims to find relationships between items in transactional data by discovering rules that predict the occurrence of an item based on other items. Association rule mining involves two steps: first generating all frequent itemsets whose support is above a minimum threshold, and then generating high confidence rules from each frequent itemset. Frequent itemset generation is computationally expensive as it requires considering all possible combinations of items.

Uploaded by

Onteru Karthik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

BBS654

Data Mining
Pinar Duygulu

Slides are adapted from


Nazli Ikizler

1
Why?
• Retailers now have massive databases full of transactional history
• Simply transaction date and list of items
• Is it possible to gain insights from this data?
• How are items in a database associated
• Association Rules predict members of a set given other members in the set
Why?
• Example Rules:
• 98% of customers that purchase tires get automotive services done
• Customers which buy mustard and ketchup also buy burgers
• Goal: find these rules from just transactional data
• Rules help with: store layout, buying patterns, add-on sales, etc
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied
extensively by the database and data mining
community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how
items purchased by customers are related.

Bread  Milk [sup = 5%, conf = 100%]

4
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t  I.
• Transaction Database T: a set of transactions T = {t1,
t2, …, tn}.

5
Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID
(transaction ID)
• A transactional dataset: A set of transactions

6
Slide from Bing Liu
Transaction data: a set of documents
• A text document data set. Each document is treated as
a “bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game

Slide from Bing Liu


7
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

8
Applications – (1)
• Items = products; baskets = sets of products someone bought in one
trip to the store.
• Example application: given that many people buy beer and diapers
together:
• Run a sale on diapers; raise price of beer.
• Only useful if many buy diapers & beer.

9
Applications – (2)
• Baskets = sentences; items = documents containing those sentences.
• Items that appear together too often could represent plagiarism.

10
Applications – (3)
• Baskets = Web pages; items = words.
• Unusual words appearing together in a large number of documents,
e.g., “Brad” and “Angelina,” may indicate an interesting relationship.

11
Frequent Itemset
• Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
TID Items
• An itemset that contains k items
1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
• Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
• E.g. ({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
• Fraction of transactions that contain an
itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form X 
1 Bread, Milk
Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain both Example:
X and Y {Milk, Diaper}  Beer
– Confidence (c)
 (Milk , Diaper, Beer ) 2
 Measures how often items in Y
appear in transactions that
s   0.4
|T| 5
contain X
 (Milk, Diaper, Beer ) 2
c   0.67
 (Milk , Diaper ) 3
15
Support and Confidence
• Support is important because
• A rule that has a low support may occur simply by chance
• A low support rule also is likely to be uninteresting from a business
perspective because it may not be profitable
• Confidence measures the reliability of the rule

16
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to
find all rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold

• Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

17
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
18
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary
partitioning of a frequent itemset

• Frequent itemset generation is still computationally expensive

19
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there are
2d possible candidate
ABCDE itemsets
20
Frequent Itemset
• Brute-force Generation
approach:
• Each itemset in the lattice is a candidate frequent itemset
• Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
• Match each transaction against every candidate
• Complexity ~ O(NMw) => Expensive since M = 2d !!!

21
Computational
• Given d uniqueComplexity
items:
• Total number of itemsets = 2d
• Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules

If d=6, R = 602 rules

22
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
• Complete search: M=2d
• Use pruning techniques to reduce M

• Reduce the number of transactions (N)


• Reduce size of N as the size of itemset increases
• Used by DHP and vertical-based mining algorithms

• Reduce the number of comparisons (NM)


• Use efficient data structures to store the candidates or
transactions
• No need to match every candidate against every transaction

23
Reducing Number of Candidates
• Apriori principle:
• If an itemset is frequent, then all of its subsets must also be
frequent
• In other words, if an itemset is infrequent, all of its supersets
must also be infrequent

• Apriori principle holds due to the following property of the


support measure:

X ,Y : ( X  Y )  s( X )  s(Y )
• Support of an itemset never exceeds the support of its subsets
• This is known as the anti-monotone property of support

24
Illustrating Apriori Principle null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
25
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 (No need to generate
{Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

68% decrease in processed subsets

26
Apriori Algorithm
• Method:
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that
are frequent

27
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

28
The Apriori Algorithm—An Example

Supmin = 2 Itemset sup


Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan {B, C, E} 2
{B, C, E}
29
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

30
Implementation of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning

31
Example of Candidates Generation
• Assume the items in Lk are listed in an order
(e.g., alphabetical)
• L3={abc, abd, acd, ace, bcd} {a,c,d} {a,c,e}

• Self-joining: L3*L3
{a,c,d,e}
– abcd from abc and abd
– acde from acd and ace acd ace ade cde

Slide from Evimaria Terzi


32
Example of Candidates Generation

• L3={abc, abd, acd, ace, bcd}


• Self-joining: L3*L3 {a,c,d} {a,c,e}
– abcd from abc and abd
X
{a,c,d,e}
– acde from acd and ace

• Pruning: acd ace ade cde


  X
– acde is removed because ade is not in L3

• C4={abcd}

33
Brute-force method for generating candidates

34
F(k-1)xF(1)

35
F(k-1)xF(k-1)

36
Further Improvement of the Apriori
Method
• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates

• Improving Apriori: general ideas


• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates

37
Reducing Number of
• Candidate counting:
Comparisons
• Scan the database of transactions to determine the support
of each candidate itemset
• To reduce the number of comparisons, store the candidates
in a hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions Hash Structure


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
38
Buckets
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?
• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets are stored in a hash-tree
• Leaf node of hash-tree contains a list of itemsets and counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a
transaction

39
Subset Operation – Support Counting
Given a transaction t, what are
the possible subsets of size 3?
Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items


40
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5
7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate
itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
41
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

42
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

43
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
44
Factors Affecting Complexity
• Choice of minimum support threshold
• lowering support threshold results in more frequent itemsets
• this may increase number of candidates and max length of frequent
itemsets
• Dimensionality (number of items) of the data set
• more space is needed to store support count of each item
• if number of frequent items also increases, both computation and I/O
costs may also increase
• Size of database
• Since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
• Average transaction width
• transaction width increases with denser data sets
• This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)

45
Compact Representation of Frequent
Itemsets

• Some itemsets are redundant because they have identical support as their
10 10 
supersets  3   
k
k 1

• Number of frequent itemsets

• It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived

• Need a compact representation

46
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent
null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequen
t Itemsets ABCD Border
E
47
Maximal Frequent Itemsets
• They form the smallest set of itemsets from which all frequent
itemsets can be derived

• Practical if an efficient algorithm exists to explicitly find the maximal


frequent itemsets without having to enumerate all their subsets

• They don’t include the support information

48
Closed Itemset
• Provide a minimal representation without losing their support
information
• An itemset is closed if none of its immediate supersets has the same
support as the itemset

49
Maximal vs Closed Itemsets
null Transaction
TID Items
Ids
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported
by any ABCDE

transactions 50
Maximal vs Closed Frequent Itemsets
Minimum support = 2 null Closed but not
maximal
124 123 1234 245 345
A B C D E
Closed and
maximal
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9
# Maximal = 4
ABCDE

51
Why are closed patterns interesting?

• Closed patterns and their frequencies alone are sufficient


representation for all the frequencies of all frequent patterns

• Proof: Assume a frequent itemset X:


• X is closed  s(X) is known
• X is not closed 
s(X) = max {s(Y) | Y is closed and X subset of Y}

53
Slide from EviMaria Terzi
Maximal vs Closed Itemsets
Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

54
Alternative Algorithm – FP
growth
FP-Growth: Frequent Pattern-Growth

 FP-tree is a compressed representation of the input data

 Adopts a divide and conquer strategy

 Compress the database representing frequent items into a frequent –pattern


tree or FP-tree

 Retains the itemset association information

 If FP-tree is small enough to fit the memory, this will allow to extract frequent
itemsets directly in memory

56
Example: FP-Growth
 The first scan of data is the same as Transactional Database
Apriori TID List of item IDS
 Derive the set of frequent 1- T100 I1,I2,I5
itemsets T200 I2,I4
 Let min-sup=2 T300 I2,I3
 Generate a set of ordered items T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
Item ID Support
count T700 I1,I3
I2 7 T800 I1,I2,I3,I5
I1 6 T900 I1,I2,I3
I3 6
I4 2
I5 2

57
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T100: {I2,I1,I5}


transaction 2- Construct the first branch:
- Items in each transaction are <I2:1>, <I1:1>,<I5:1>
processed in order
null
Item ID Support
count I2:1
I2 7
I1 6 I1:1
I3 6
I4 2 I5:1
I5 2

58
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T200: {I2,I4}


transaction 2- Construct the second branch:
- Items in each transaction are <I2:1>, <I4:1>
processed in order
null
Item ID Support
count I2:2
I2:1
I2 7
I1 6 I1:1 I4:1
I3 6
I4 2 I5:1
I5 2

59
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T300: {I2,I3}


transaction 2- Construct the third branch:
- Items in each transaction are <I2:2>, <I3:1>
processed in order
null
Item ID Support
count I2:2
I2:3
I2 7
I1 6 I1:1 I4:1
I3 6 I3:1
I4 2 I5:1
I5 2

60
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T400: {I2,I1,I4}


transaction 2- Construct the fourth branch:
- Items in each transaction are <I2:3>, <I1:1>,<I4:1>
processed in order
null
Item ID Support
count I2:4
I2:3
I2 7
I1 6 I1:2
I1:1 I4:1
I3 6 I3:1
I4 2 I5:1
I5 2 I4:1
61
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T400: {I1,I3}


transaction 2- Construct the fifth branch:
- Items in each transaction are <I1:1>, <I3:1>
processed in order
null
Item ID Support
count I2:4 I1:1
I2 7
I1 6 I1:2 I4:1
I3 6 I3:1 I3:1
I4 2 I5:1
I5 2 I4:1
62
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2 When a branch of a
I3:2 transaction is added, the
count for each node
along a common prefix is
I5:1 incremented by 1 63
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7
I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

The problem of mining frequent patterns in databases is


transformed to that of mining the FP-tree

64
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7
I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

-Occurrences of I5: <I2,I1,I5> and <I2,I1,I3,I5>


-Two prefix Paths <I2, I1: 1> and <I2,I1,I3: 1>
-Conditional FP tree contains only <I2: 2, I1: 2>, I3 is not
considered because its support count of 1 is less than the
minimum support count.
-Frequent patterns {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2} 65
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

TID Conditional Pattern Base Conditional FP-tree


I5 {{I2,I1:1},{I2,I1,I3:1}} <I2:2,I1:2>
I4 {{I2,I1:1},{I2,1}} <I2:2>
I3 {{I2,I1:2},{I2:2}, {I1:2}} <I2:4,I1:2>,<I1:2>
I1 {I2,4} <I2:4>

66
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

TID Conditional FP-tree Frequent Patterns Generated


I5 <I2:2,I1:2> {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}
I4 <I2:2> {I2,I4:2}
I3 <I2:4,I1:2>,<I1:2> {I2,I3:4},{I1,I3:4},{I2,I1,I3:2}
I1 <I2:4> {I2,I1:4}

67
FP-growth properties

 FP-growth transforms the problem of finding long frequent patterns to


searching for shorter once recursively and the concatenating the suffix

 It uses the least frequent suffix offering a good selectivity

 It reduces the search cost

 If the tree does not fit into main memory, partition the database

 Efficient and scalable for mining both long and short frequent patterns

68
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary
partitioning of a frequent itemset

69
Re-Definition: Association Rule
Let D be database of transactions
– e.g.: Transaction ID Items
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F

• Let I be the set of items that appear in the


database, e.g., I={A,B,C,D,E,F}
• A rule is defined by X  Y, where XI, YI,
and XY=
– e.g.: {B,C}  {A} is a rule

70
Generating Association Rules
 Once the frequent itemsets have been found, it is straightforward to generate
strong association rules that satisfy:

 minimum Support
 minimum confidence

 Relation between support and confidence:

support_count(AB)
Confidence(AB) = P(B|A)=
support_count(A)

 Support_count(AB) is the number of transactions containing the itemsets A  B


 Support_count(A) is the number of transactions containing the itemset A.

71
Generating Association Rules

 For each frequent itemset L, generate all non empty subsets of L


 For every no empty subset S of L, output the rule:

S  (L-S)

If (support_count(L)/support_count(S)) >= min_conf

72
Example
 Suppose the frequent Itemset
L={I1,I2,I5} Transactional Database
 Subsets of L are: {I1,I2},
TID List of item IDS
{I1,I5},{I2,I5},{I1},{I2},{I5}
T100 I1,I2,I5
 Association rules :
T200 I2,I4
I1  I2  I5 confidence = 2/4= 50% T300 I2,I3
I1  I5  I2 confidence=2/2=100% T400 I1,I2,I4
I2  I5  I1 confidence=2/2=100% T500 I1,I3
I1  I2  I5 confidence=2/6=33% T600 I2,I3
I2  I1  I5 confidence=2/7=29% T700 I1,I3
I5  I2  I2 confidence=2/2=100% T800 I1,I2,I3,I5
T900 I1,I2,I3
If the minimum confidence =70%

73
Rule Generation
• Given a frequent itemset L, find all non-empty subsets f  L such that
f  L – f satisfies the minimum confidence requirement
• If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

• If |L| = k, then there are 2k – 2 candidate association rules (ignoring L


  and   L)

74
Rule Generation
• How to efficiently generate rules from frequent itemsets?
• In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)

• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)


• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

75
Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules

76
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same
prefix
in the rule consequent CD=>AB BD=>AC

• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

D=>ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence

77
Problems with the association mining
• Single minsup: It assumes that all items in the data
are of the same nature and/or have similar
frequencies.
• Not true: In many applications, some items appear
very frequently in the data, while others rarely
appear.
E.g., in a supermarket, people buy food processor and
cooking pan much less frequently than they buy bread and
milk.

78
Effect of Support Distribution
• Many real data sets have skewed support distribution

Support
distribution of
a retail data set

79
Rare Item Problem
• If the frequencies of items vary a great deal, we will
encounter two problems
• If minsup is set too high, those rules that involve rare items
will not be found.
• To find rules that involve both frequent and rare items,
minsup has to be set very low. This may cause
combinatorial explosion because those frequent items will
be associated with one another in all possible ways.
• Using a single minimum support threshold may not be
effective

80
Multiple minsups model

• The minimum support of a rule is expressed in


terms of minimum item supports (MIS) of the items
that appear in the rule.
• Each item can have a minimum item support.
• By providing different MIS values for different
items, the user effectively expresses different
support requirements for different rules.

81
Minsup of a rule

• Let MIS(i) be the MIS value of item i. The minsup of a


rule R is the lowest MIS value of the items in the rule.
• I.e., a rule R: a1, a2, …, ak  ak+1, …, ar satisfies its
minimum support if its actual support is 
min(MIS(a1), MIS(a2), …, MIS(ar)).

82
An Example
• Consider the following items:
bread, shoes, clothes
The user-specified MIS values are as follows:
MIS(bread) = 2% MIS(shoes) = 0.1%
MIS(clothes) = 0.2%
The following rule doesn’t satisfy its minsup:
clothes  bread [sup=0.15%,conf =70%]
The following rule satisfies its minsup:
clothes  shoes [sup=0.15%,conf =70%]

83
Pattern Evaluation
• Association rule algorithms tend to produce too many
rules
• many of them are uninteresting or redundant
• Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence

• Interestingness measures can be used to prune/rank


the derived patterns

• In the original formulation of association rules, support


& confidence are the only measures used

84
Application of Interestingness Measure
Interestingness
Measures

85
Computing Interestingness Measure
• Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table

Contingency table for X  Y


Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 |T| f00: support of X and Y

Used to define various measures


support, confidence, lift, Gini,
J-measure, etc.

86
Drawback of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375

87
Statistical-based Measures
• Measures that take into account statistical dependence
P(Y | X )
Lift 
P(Y )
P( X , Y )
Interest 
P( X ) P(Y )
PS  P( X , Y )  P( X ) P(Y )
P( X , Y )  P( X ) P(Y )
  coefficient 
P( X )[1  P( X )]P(Y )[1  P(Y )]
88
Example: Lift/Interest

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

89
Subjective Interestingness Measure
• Objective measure:
• Rank patterns based on statistics computed from data
• e.g., 21 measures of association (support, confidence, Laplace,
Gini, mutual information, Jaccard, etc).
• Subjective measure:
• Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
• A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)

90
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)

+ Pattern expected to be frequent

- Pattern expected to be infrequent


Pattern found to be frequent

Pattern found to be infrequent

+ - Expected Patterns

- + Unexpected Patterns

• Need to combine expectation of users with evidence from data


(i.e., extracted patterns)
91
Extra
Illustration

93
Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9
2,5,8
234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

94
Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9
2,5,8
234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458

95
Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9
2,5,8
234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

96
FP-growth Algorithm
• Use a compressed representation of the database using an FP-tree

• Once an FP-tree has been constructed, it uses a recursive divide-and-


conquer approach to mine the frequent itemsets

97
FP-tree construction null
After reading TID=1:

A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
98
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D}
Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1

Header table D:1


C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation

99
FP-growth
Conditional Pattern base for
null
D:
P = {(A:1,B:1,C:1),
A:7 B:1 (A:1,B:1),
(A:1,C:1),
(A:1),
B:5 C:1 (B:1,C:1)}
C:1 D:1
Recursively apply FP-growth
D:1 on P
C:3
D:1
D:1 Frequent Itemsets found (with
sup > 1):
D:1 AD, BD, CD, ACD, BCD

100

You might also like