0% found this document useful (0 votes)

43 views97 pages

Association Rule Mining

The document discusses association rule mining, which aims to find relationships between items in transactional data by discovering rules that predict the occurrence of an item based on other items. Association rule mining involves two steps: first generating all frequent itemsets whose support is above a minimum threshold, and then generating high confidence rules from each frequent itemset. Frequent itemset generation is computationally expensive as it requires considering all possible combinations of items.

Uploaded by

Onteru Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views97 pages

Association Rule Mining

Uploaded by

Onteru Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 97

BBS654

Data Mining
Pinar Duygulu

Slides are adapted from

Nazli Ikizler

1
Why?
• Retailers now have massive databases full of transactional history
• Simply transaction date and list of items
• Is it possible to gain insights from this data?
• How are items in a database associated
• Association Rules predict members of a set given other members in the set
Why?
• Example Rules:
• 98% of customers that purchase tires get automotive services done
• Customers which buy mustard and ketchup also buy burgers
• Goal: find these rules from just transactional data
• Rules help with: store layout, buying patterns, add-on sales, etc
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied
extensively by the database and data mining
community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how
items purchased by customers are related.

Bread  Milk [sup = 5%, conf = 100%]

4
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t  I.
• Transaction Database T: a set of transactions T = {t1,
t2, …, tn}.

5
Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID
(transaction ID)
• A transactional dataset: A set of transactions

6
Slide from Bing Liu
Transaction data: a set of documents
• A text document data set. Each document is treated as
a “bag” of keywords
doc1: Student, Teach, School
doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game

Slide from Bing Liu

7
Association Rule Mining
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

8
Applications – (1)
• Items = products; baskets = sets of products someone bought in one
trip to the store.
• Example application: given that many people buy beer and diapers
together:
• Run a sale on diapers; raise price of beer.
• Only useful if many buy diapers & beer.

9
Applications – (2)
• Baskets = sentences; items = documents containing those sentences.
• Items that appear together too often could represent plagiarism.

10
Applications – (3)
• Baskets = Web pages; items = words.
• Unusual words appearing together in a large number of documents,
e.g., “Brad” and “Angelina,” may indicate an interesting relationship.

11
Frequent Itemset
• Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
TID Items
• An itemset that contains k items
1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
• Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
• E.g. ({Milk, Bread,Diaper}) = 2 4 Bread, Milk, Diaper, Beer
• Support 5 Bread, Milk, Diaper, Coke
• Fraction of transactions that contain an
itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
Association Rule
TID Items
– An implication expression of the form X 
1 Bread, Milk
Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain both Example:
X and Y {Milk, Diaper}  Beer
– Confidence (c)
 (Milk , Diaper, Beer ) 2
 Measures how often items in Y
appear in transactions that
s   0.4
|T| 5
contain X
 (Milk, Diaper, Beer ) 2
c   0.67
 (Milk , Diaper ) 3
15
Support and Confidence
• Support is important because
• A rule that has a low support may occur simply by chance
• A low support rule also is likely to be uninteresting from a business
perspective because it may not be profitable
• Confidence measures the reliability of the rule

16
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to
find all rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold

• Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

17
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
18
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary
partitioning of a frequent itemset

• Frequent itemset generation is still computationally expensive

19
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given d items, there are
2d possible candidate
ABCDE itemsets
20
Frequent Itemset
• Brute-force Generation
approach:
• Each itemset in the lattice is a candidate frequent itemset
• Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
• Match each transaction against every candidate
• Complexity ~ O(NMw) => Expensive since M = 2d !!!

21
Computational
• Given d uniqueComplexity
items:
• Total number of itemsets = 2d
• Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules

22
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
• Complete search: M=2d
• Use pruning techniques to reduce M

• Reduce the number of transactions (N)

• Reduce size of N as the size of itemset increases
• Used by DHP and vertical-based mining algorithms

• Reduce the number of comparisons (NM)

• Use efficient data structures to store the candidates or
transactions
• No need to match every candidate against every transaction

23
Reducing Number of Candidates
• Apriori principle:
• If an itemset is frequent, then all of its subsets must also be
frequent
• In other words, if an itemset is infrequent, all of its supersets
must also be infrequent

• Apriori principle holds due to the following property of the

support measure:

X ,Y : ( X  Y )  s( X )  s(Y )
• Support of an itemset never exceeds the support of its subsets
• This is known as the anti-monotone property of support

24
Illustrating Apriori Principle null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
25
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 (No need to generate
{Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count

6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

68% decrease in processed subsets

26
Apriori Algorithm
• Method:
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that
are frequent

27
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

28
The Apriori Algorithm—An Example

Supmin = 2 Itemset sup

Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup

3rd scan {B, C, E} 2
{B, C, E}
29
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

30
Implementation of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning

31
Example of Candidates Generation
• Assume the items in Lk are listed in an order
(e.g., alphabetical)
• L3={abc, abd, acd, ace, bcd} {a,c,d} {a,c,e}

• Self-joining: L3*L3
{a,c,d,e}
– abcd from abc and abd
– acde from acd and ace acd ace ade cde

Slide from Evimaria Terzi

32
Example of Candidates Generation

• L3={abc, abd, acd, ace, bcd}

• Self-joining: L3*L3 {a,c,d} {a,c,e}
– abcd from abc and abd
X
{a,c,d,e}
– acde from acd and ace

• Pruning: acd ace ade cde

  X
– acde is removed because ade is not in L3

• C4={abcd}

33
Brute-force method for generating candidates

34
F(k-1)xF(1)

35
F(k-1)xF(k-1)

36
Further Improvement of the Apriori
Method
• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates

• Improving Apriori: general ideas

• Reduce passes of transaction database scans
• Shrink number of candidates
• Facilitate support counting of candidates

37
Reducing Number of
• Candidate counting:
Comparisons
• Scan the database of transactions to determine the support
of each candidate itemset
• To reduce the number of comparisons, store the candidates
in a hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions Hash Structure

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
38
Buckets
How to Count Supports of Candidates?
• Why counting supports of candidates a problem?
• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets are stored in a hash-tree
• Leaf node of hash-tree contains a list of itemsets and counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a
transaction

39
Subset Operation – Support Counting
Given a transaction t, what are
the possible subsets of size 3?
Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items

40
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5
7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate
itemsets exceeds max leaf size, split the node)

Hash function 234

3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
41
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

42
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

43
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
44
Factors Affecting Complexity
• Choice of minimum support threshold
• lowering support threshold results in more frequent itemsets
• this may increase number of candidates and max length of frequent
itemsets
• Dimensionality (number of items) of the data set
• more space is needed to store support count of each item
• if number of frequent items also increases, both computation and I/O
costs may also increase
• Size of database
• Since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
• Average transaction width
• transaction width increases with denser data sets
• This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)

45
Compact Representation of Frequent
Itemsets

• Some itemsets are redundant because they have identical support as their
10 10 
supersets  3   
k
k 1

• Number of frequent itemsets

• It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived

• Need a compact representation

46
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent
null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequen
t Itemsets ABCD Border
E
47
Maximal Frequent Itemsets
• They form the smallest set of itemsets from which all frequent
itemsets can be derived

• Practical if an efficient algorithm exists to explicitly find the maximal

frequent itemsets without having to enumerate all their subsets

• They don’t include the support information

48
Closed Itemset
• Provide a minimal representation without losing their support
information
• An itemset is closed if none of its immediate supersets has the same
support as the itemset

49
Maximal vs Closed Itemsets
null Transaction
TID Items
Ids
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported
by any ABCDE

transactions 50
Maximal vs Closed Frequent Itemsets
Minimum support = 2 null Closed but not
maximal
124 123 1234 245 345
A B C D E
Closed and
maximal
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9
# Maximal = 4
ABCDE

51
Why are closed patterns interesting?

• Closed patterns and their frequencies alone are sufficient

representation for all the frequencies of all frequent patterns

• Proof: Assume a frequent itemset X:

• X is closed  s(X) is known
• X is not closed 
s(X) = max {s(Y) | Y is closed and X subset of Y}

53
Slide from EviMaria Terzi
Maximal vs Closed Itemsets
Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

54
Alternative Algorithm – FP
growth
FP-Growth: Frequent Pattern-Growth

 FP-tree is a compressed representation of the input data

 Adopts a divide and conquer strategy

 Compress the database representing frequent items into a frequent –pattern

tree or FP-tree

 Retains the itemset association information

 If FP-tree is small enough to fit the memory, this will allow to extract frequent
itemsets directly in memory

56
Example: FP-Growth
 The first scan of data is the same as Transactional Database
Apriori TID List of item IDS
 Derive the set of frequent 1- T100 I1,I2,I5
itemsets T200 I2,I4
 Let min-sup=2 T300 I2,I3
 Generate a set of ordered items T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
Item ID Support
count T700 I1,I3
I2 7 T800 I1,I2,I3,I5
I1 6 T900 I1,I2,I3
I3 6
I4 2
I5 2

57
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T100: {I2,I1,I5}

transaction 2- Construct the first branch:
- Items in each transaction are <I2:1>, <I1:1>,<I5:1>
processed in order
null
Item ID Support
count I2:1
I2 7
I1 6 I1:1
I3 6
I4 2 I5:1
I5 2

58
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T200: {I2,I4}

transaction 2- Construct the second branch:
- Items in each transaction are <I2:1>, <I4:1>
processed in order
null
Item ID Support
count I2:2
I2:1
I2 7
I1 6 I1:1 I4:1
I3 6
I4 2 I5:1
I5 2

59
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T300: {I2,I3}

transaction 2- Construct the third branch:
- Items in each transaction are <I2:2>, <I3:1>
processed in order
null
Item ID Support
count I2:2
I2:3
I2 7
I1 6 I1:1 I4:1
I3 6 I3:1
I4 2 I5:1
I5 2

60
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T400: {I2,I1,I4}

transaction 2- Construct the fourth branch:
- Items in each transaction are <I2:3>, <I1:1>,<I4:1>
processed in order
null
Item ID Support
count I2:4
I2:3
I2 7
I1 6 I1:2
I1:1 I4:1
I3 6 I3:1
I4 2 I5:1
I5 2 I4:1
61
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

- Create a branch for each 1- Order the items T400: {I1,I3}

transaction 2- Construct the fifth branch:
- Items in each transaction are <I1:1>, <I3:1>
processed in order
null
Item ID Support
count I2:4 I1:1
I2 7
I1 6 I1:2 I4:1
I3 6 I3:1 I3:1
I4 2 I5:1
I5 2 I4:1
62
Construct the FP-Tree
Transactional Database
TID Items TID Items TID Items
T100 I1,I2,I5 T400 I1,I2,I4 T700 I1,I3
T200 I2,I4 T500 I1,I3 T800 I1,I2,I3,I5
T300 I2,I3 T600 I2,I3 T900 I1,I2,I3

null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2 When a branch of a
I3:2 transaction is added, the
count for each node
along a common prefix is
I5:1 incremented by 1 63
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7
I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

The problem of mining frequent patterns in databases is

transformed to that of mining the FP-tree

64
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7
I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

-Occurrences of I5: <I2,I1,I5> and <I2,I1,I3,I5>

-Two prefix Paths <I2, I1: 1> and <I2,I1,I3: 1>
-Conditional FP tree contains only <I2: 2, I1: 2>, I3 is not
considered because its support count of 1 is less than the
minimum support count.
-Frequent patterns {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2} 65
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

TID Conditional Pattern Base Conditional FP-tree

I5 {{I2,I1:1},{I2,I1,I3:1}} <I2:2,I1:2>
I4 {{I2,I1:1},{I2,1}} <I2:2>
I3 {{I2,I1:2},{I2:2}, {I1:2}} <I2:4,I1:2>,<I1:2>
I1 {I2,4} <I2:4>

66
Construct the FP-Tree
null
I2:7 I1:2
Item ID Support
count
I2 7 I1:4 I4:1
I1 6 I3:2 I3:2
I3 6 I5:1
I4 2
I4:1
I5 2
I3:2

I5:1

TID Conditional FP-tree Frequent Patterns Generated

I5 <I2:2,I1:2> {I2,I5:2}, {I1,I5:2},{I2,I1,I5:2}
I4 <I2:2> {I2,I4:2}
I3 <I2:4,I1:2>,<I1:2> {I2,I3:4},{I1,I3:4},{I2,I1,I3:2}
I1 <I2:4> {I2,I1:4}

67
FP-growth properties

 FP-growth transforms the problem of finding long frequent patterns to

searching for shorter once recursively and the concatenating the suffix

 It uses the least frequent suffix offering a good selectivity

 It reduces the search cost

 If the tree does not fit into main memory, partition the database

 Efficient and scalable for mining both long and short frequent patterns

68
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary
partitioning of a frequent itemset

69
Re-Definition: Association Rule
Let D be database of transactions
– e.g.: Transaction ID Items
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F

• Let I be the set of items that appear in the

database, e.g., I={A,B,C,D,E,F}
• A rule is defined by X  Y, where XI, YI,
and XY=
– e.g.: {B,C}  {A} is a rule

70
Generating Association Rules
 Once the frequent itemsets have been found, it is straightforward to generate
strong association rules that satisfy:

 minimum Support
 minimum confidence

 Relation between support and confidence:

support_count(AB)
Confidence(AB) = P(B|A)=
support_count(A)

 Support_count(AB) is the number of transactions containing the itemsets A  B

 Support_count(A) is the number of transactions containing the itemset A.

71
Generating Association Rules

 For each frequent itemset L, generate all non empty subsets of L

 For every no empty subset S of L, output the rule:

S  (L-S)

If (support_count(L)/support_count(S)) >= min_conf

72
Example
 Suppose the frequent Itemset
L={I1,I2,I5} Transactional Database
 Subsets of L are: {I1,I2},
TID List of item IDS
{I1,I5},{I2,I5},{I1},{I2},{I5}
T100 I1,I2,I5
 Association rules :
T200 I2,I4
I1  I2  I5 confidence = 2/4= 50% T300 I2,I3
I1  I5  I2 confidence=2/2=100% T400 I1,I2,I4
I2  I5  I1 confidence=2/2=100% T500 I1,I3
I1  I2  I5 confidence=2/6=33% T600 I2,I3
I2  I1  I5 confidence=2/7=29% T700 I1,I3
I5  I2  I2 confidence=2/2=100% T800 I1,I2,I3,I5
T900 I1,I2,I3
If the minimum confidence =70%

73
Rule Generation
• Given a frequent itemset L, find all non-empty subsets f  L such that
f  L – f satisfies the minimum confidence requirement
• If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

• If |L| = k, then there are 2k – 2 candidate association rules (ignoring L

  and   L)

74
Rule Generation
• How to efficiently generate rules from frequent itemsets?
• In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)

• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)

• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

75
Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules

76
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same
prefix
in the rule consequent CD=>AB BD=>AC

• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

D=>ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence

77
Problems with the association mining
• Single minsup: It assumes that all items in the data
are of the same nature and/or have similar
frequencies.
• Not true: In many applications, some items appear
very frequently in the data, while others rarely
appear.
E.g., in a supermarket, people buy food processor and
cooking pan much less frequently than they buy bread and
milk.

78
Effect of Support Distribution
• Many real data sets have skewed support distribution

Support
distribution of
a retail data set

79
Rare Item Problem
• If the frequencies of items vary a great deal, we will
encounter two problems
• If minsup is set too high, those rules that involve rare items
will not be found.
• To find rules that involve both frequent and rare items,
minsup has to be set very low. This may cause
combinatorial explosion because those frequent items will
be associated with one another in all possible ways.
• Using a single minimum support threshold may not be
effective

80
Multiple minsups model

• The minimum support of a rule is expressed in

terms of minimum item supports (MIS) of the items
that appear in the rule.
• Each item can have a minimum item support.
• By providing different MIS values for different
items, the user effectively expresses different
support requirements for different rules.

81
Minsup of a rule

• Let MIS(i) be the MIS value of item i. The minsup of a

rule R is the lowest MIS value of the items in the rule.
• I.e., a rule R: a1, a2, …, ak  ak+1, …, ar satisfies its
minimum support if its actual support is 
min(MIS(a1), MIS(a2), …, MIS(ar)).

82
An Example
• Consider the following items:
bread, shoes, clothes
The user-specified MIS values are as follows:
MIS(bread) = 2% MIS(shoes) = 0.1%
MIS(clothes) = 0.2%
The following rule doesn’t satisfy its minsup:
clothes  bread [sup=0.15%,conf =70%]
The following rule satisfies its minsup:
clothes  shoes [sup=0.15%,conf =70%]

83
Pattern Evaluation
• Association rule algorithms tend to produce too many
rules
• many of them are uninteresting or redundant
• Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence

• Interestingness measures can be used to prune/rank

the derived patterns

• In the original formulation of association rules, support

& confidence are the only measures used

84
Application of Interestingness Measure
Interestingness
Measures

85
Computing Interestingness Measure
• Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table

Contingency table for X  Y

Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 |T| f00: support of X and Y

Used to define various measures

support, confidence, lift, Gini,
J-measure, etc.

86
Drawback of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9
 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375

87
Statistical-based Measures
• Measures that take into account statistical dependence
P(Y | X )
Lift 
P(Y )
P( X , Y )
Interest 
P( X ) P(Y )
PS  P( X , Y )  P( X ) P(Y )
P( X , Y )  P( X ) P(Y )
  coefficient 
P( X )[1  P( X )]P(Y )[1  P(Y )]
88
Example: Lift/Interest

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

89
Subjective Interestingness Measure
• Objective measure:
• Rank patterns based on statistics computed from data
• e.g., 21 measures of association (support, confidence, Laplace,
Gini, mutual information, Jaccard, etc).
• Subjective measure:
• Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
• A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)

90
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)

+ Pattern expected to be frequent

- Pattern expected to be infrequent

Pattern found to be frequent

Pattern found to be infrequent

+ - Expected Patterns

- + Unexpected Patterns

• Need to combine expectation of users with evidence from data

(i.e., extracted patterns)
91
Extra
Illustration

93
Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9
2,5,8
234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

94
Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9
2,5,8
234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458

95
Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9
2,5,8
234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

96
FP-growth Algorithm
• Use a compressed representation of the database using an FP-tree

• Once an FP-tree has been constructed, it uses a recursive divide-and-

conquer approach to mine the frequent itemsets

97
FP-tree construction null
After reading TID=1:

A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
98
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D}
Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1

Header table D:1

C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation

99
FP-growth
Conditional Pattern base for
null
D:
P = {(A:1,B:1,C:1),
A:7 B:1 (A:1,B:1),
(A:1,C:1),
(A:1),
B:5 C:1 (B:1,C:1)}
C:1 D:1
Recursively apply FP-growth
D:1 on P
C:3
D:1
D:1 Frequent Itemsets found (with
sup > 1):
D:1 AD, BD, CD, ACD, BCD

100

Principia Mathematica - Alfred North Whitehead and Bertrand Russell - Volume 3, 1960 - Cambridge - Anna's Archive
100% (1)
Principia Mathematica - Alfred North Whitehead and Bertrand Russell - Volume 3, 1960 - Cambridge - Anna's Archive
508 pages
Exercise No. 1: Balanced Transportation Problem
No ratings yet
Exercise No. 1: Balanced Transportation Problem
7 pages
SLM Mathematics
No ratings yet
SLM Mathematics
844 pages
Data Mining - Module2
No ratings yet
Data Mining - Module2
112 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
Slides
No ratings yet
Slides
92 pages
COS10022 DSP Week06 Association Rules
No ratings yet
COS10022 DSP Week06 Association Rules
52 pages
06 FPBasic
No ratings yet
06 FPBasic
77 pages
Context-Free Grammar (CFG)
No ratings yet
Context-Free Grammar (CFG)
27 pages
CS2202 AssociationRuleMining
No ratings yet
CS2202 AssociationRuleMining
59 pages
Class 4-Associative Analysis
No ratings yet
Class 4-Associative Analysis
42 pages
DP-Assignment Brief A1 - 16ED2 - Updated
100% (1)
DP-Assignment Brief A1 - 16ED2 - Updated
17 pages
Data Mining Association Analysis
No ratings yet
Data Mining Association Analysis
18 pages
Minimal Keys and Antikeys
No ratings yet
Minimal Keys and Antikeys
11 pages
Session 8-Association Rules Mining
No ratings yet
Session 8-Association Rules Mining
75 pages
Book Algebra
No ratings yet
Book Algebra
226 pages
CH 5
No ratings yet
CH 5
53 pages
Lect 6
No ratings yet
Lect 6
74 pages
Association Rules
No ratings yet
Association Rules
39 pages
ML Module3
No ratings yet
ML Module3
83 pages
Unit 4 .3 Association Analysis
No ratings yet
Unit 4 .3 Association Analysis
50 pages
Association Rule
No ratings yet
Association Rule
22 pages
2.2. Lebesgue Outer Measure.
No ratings yet
2.2. Lebesgue Outer Measure.
3 pages
BD25
No ratings yet
BD25
19 pages
Association Rule
No ratings yet
Association Rule
17 pages
Hackathlon 3
No ratings yet
Hackathlon 3
2 pages
Dmunit 2
No ratings yet
Dmunit 2
85 pages
Data Science - Python Data Types - 14 - 04 - 2025
No ratings yet
Data Science - Python Data Types - 14 - 04 - 2025
6 pages
CH 5
No ratings yet
CH 5
45 pages
04-Association Rule Mining
No ratings yet
04-Association Rule Mining
22 pages
16-Efficient and Scalable Frequent Item Set Mining Methods - Apriori Algorithm-05-02-2025
No ratings yet
16-Efficient and Scalable Frequent Item Set Mining Methods - Apriori Algorithm-05-02-2025
37 pages
Association Rule Mining Spring 2022
No ratings yet
Association Rule Mining Spring 2022
84 pages
Association Rule Mining
No ratings yet
Association Rule Mining
72 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
Data Mining and Predictive Modeling: Lecture 9: Association Rule Mining, Apriori Algorithm
No ratings yet
Data Mining and Predictive Modeling: Lecture 9: Association Rule Mining, Apriori Algorithm
24 pages
Association
No ratings yet
Association
54 pages
14-Introduction To Apriori Level Wise Algorithm-03-09-2024
No ratings yet
14-Introduction To Apriori Level Wise Algorithm-03-09-2024
32 pages
04 Frequent Patterns Analysis
No ratings yet
04 Frequent Patterns Analysis
37 pages
6 - Association Rules - For Students
No ratings yet
6 - Association Rules - For Students
39 pages
B.M.S.College of Engineering, Bangalore-19: The Min Term Corresponding To Decimal Number 15 Is
No ratings yet
B.M.S.College of Engineering, Bangalore-19: The Min Term Corresponding To Decimal Number 15 Is
1 page
A Survey of Work On Multiple Valued Logi
No ratings yet
A Survey of Work On Multiple Valued Logi
5 pages
Association Rule Mining
No ratings yet
Association Rule Mining
24 pages
Lab 1
No ratings yet
Lab 1
7 pages
DM - Unit 2
No ratings yet
DM - Unit 2
49 pages
P-3 1 5-Association
No ratings yet
P-3 1 5-Association
46 pages
Data Mining Mod 2
No ratings yet
Data Mining Mod 2
7 pages
Association: Market Basket Analysis
No ratings yet
Association: Market Basket Analysis
40 pages
Rule Mining by Akshay Rele
No ratings yet
Rule Mining by Akshay Rele
42 pages
06 FPBasic
No ratings yet
06 FPBasic
103 pages
Complexity of Problems: Computability and Complexity Andrei Bulatov
No ratings yet
Complexity of Problems: Computability and Complexity Andrei Bulatov
21 pages
MS (Data Science) Fall 2020 Semester
No ratings yet
MS (Data Science) Fall 2020 Semester
36 pages
Week4 Chap3 Recursion Branch and Bound Cbus
No ratings yet
Week4 Chap3 Recursion Branch and Bound Cbus
13 pages
Unit-5: Concept Description and Association Rule Mining
No ratings yet
Unit-5: Concept Description and Association Rule Mining
39 pages
Relationsandfunctionslessonproper 160929053921
No ratings yet
Relationsandfunctionslessonproper 160929053921
77 pages
Mathematics and Symbolic Logic (Students Material & Assignment)
100% (1)
Mathematics and Symbolic Logic (Students Material & Assignment)
15 pages
New Microsoft Power Point Presentation
No ratings yet
New Microsoft Power Point Presentation
18 pages
Association Rules & Frequent Itemsets: The Market-Basket Problem
No ratings yet
Association Rules & Frequent Itemsets: The Market-Basket Problem
5 pages
3final CH 5 Concept
No ratings yet
3final CH 5 Concept
101 pages
Indice Libro Introducción A Las Estructuras Matemáticas y Demostraciones
No ratings yet
Indice Libro Introducción A Las Estructuras Matemáticas y Demostraciones
4 pages
Module1 Part2
No ratings yet
Module1 Part2
17 pages
Unit 5
No ratings yet
Unit 5
40 pages
Analysis
0% (1)
Analysis
13 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
Seminar 6
No ratings yet
Seminar 6
30 pages
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
No ratings yet
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
41 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
54 pages
Binary Search
No ratings yet
Binary Search
6 pages
Intuition, Proof and Certainty: Lesson 3.2
No ratings yet
Intuition, Proof and Certainty: Lesson 3.2
15 pages
Basic Concepts of Algorithm
No ratings yet
Basic Concepts of Algorithm
2 pages
Lecture 13
No ratings yet
Lecture 13
28 pages
Dataanalytics Unit-4
No ratings yet
Dataanalytics Unit-4
23 pages
Unit 2
No ratings yet
Unit 2
14 pages
Rule Mining
No ratings yet
Rule Mining
20 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
SCS1201
No ratings yet
SCS1201
97 pages
1) Linear Search & Binary Search
No ratings yet
1) Linear Search & Binary Search
16 pages
Unit-3 Logic Gates
No ratings yet
Unit-3 Logic Gates
146 pages
w5 Graph Coloring
No ratings yet
w5 Graph Coloring
12 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-2: Tid Items
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-2: Tid Items
4 pages
Association Rules
No ratings yet
Association Rules
24 pages
Data Mining Task - Association Rule Mining
No ratings yet
Data Mining Task - Association Rule Mining
30 pages
Categorical Syllogism Logic
No ratings yet
Categorical Syllogism Logic
3 pages
10A Worksheet
No ratings yet
10A Worksheet
4 pages
DM Association
No ratings yet
DM Association
43 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
Arm PPT
No ratings yet
Arm PPT
15 pages
Nested Quantifiers - Introduction: X y (X+y 0)
No ratings yet
Nested Quantifiers - Introduction: X y (X+y 0)
33 pages
Cheese Making in the Suburbs of South Africa: The Rainbow Nation
From Everand
Cheese Making in the Suburbs of South Africa: The Rainbow Nation
Nona Ross
No ratings yet
The Complete Guide to Building Classic Barns, Fences, Storage Sheds, Animal Pens, Outbuilding, Greenhouses, Farm Equipment, & Tools: A Step-by-Step Guide to Building Everything You Might Need on a Small Farm
From Everand
The Complete Guide to Building Classic Barns, Fences, Storage Sheds, Animal Pens, Outbuilding, Greenhouses, Farm Equipment, & Tools: A Step-by-Step Guide to Building Everything You Might Need on a Small Farm
Tim Bodmar
4/5 (1)

Association Rule Mining

Uploaded by

Association Rule Mining

Uploaded by

BBS654

Slides are adapted from

Bread  Milk [sup = 5%, conf = 100%]

Slide from Bing Liu

• Frequent itemset generation is still computationally expensive

ABCD ABCE ABDE ACDE BCDE

If d=6, R = 602 rules

If d=6, R = 602 rules

• Reduce the number of transactions (N)

• Reduce the number of comparisons (NM)

• Apriori principle holds due to the following property of the

ABCD ABCE ABDE ACDE BCDE

If every subset is considered, Itemset Count

68% decrease in processed subsets

Supmin = 2 Itemset sup

C3 Itemset L3 Itemset sup

Slide from Evimaria Terzi

• L3={abc, abd, acd, ace, bcd}

• Pruning: acd ace ade cde

• Improving Apriori: general ideas

Transactions Hash Structure

Level 3 Subsets of 3 items

Hash function 234

• Number of frequent itemsets

• Need a compact representation

ABCD ABCE ABDE ACDE BCDE

• Practical if an efficient algorithm exists to explicitly find the maximal

• They don’t include the support information

• Closed patterns and their frequencies alone are sufficient

• Proof: Assume a frequent itemset X:

 FP-tree is a compressed representation of the input data

 Adopts a divide and conquer strategy

 Compress the database representing frequent items into a frequent –pattern

 Retains the itemset association information

- Create a branch for each 1- Order the items T100: {I2,I1,I5}

- Create a branch for each 1- Order the items T200: {I2,I4}

- Create a branch for each 1- Order the items T300: {I2,I3}

- Create a branch for each 1- Order the items T400: {I2,I1,I4}

- Create a branch for each 1- Order the items T400: {I1,I3}

The problem of mining frequent patterns in databases is

-Occurrences of I5: <I2,I1,I5> and <I2,I1,I3,I5>

TID Conditional Pattern Base Conditional FP-tree

TID Conditional FP-tree Frequent Patterns Generated

 FP-growth transforms the problem of finding long frequent patterns to

 It uses the least frequent suffix offering a good selectivity

 It reduces the search cost

• Let I be the set of items that appear in the

 Relation between support and confidence:

 Support_count(AB) is the number of transactions containing the itemsets A  B

 For each frequent itemset L, generate all non empty subsets of L

If (support_count(L)/support_count(S)) >= min_conf

• If |L| = k, then there are 2k – 2 candidate association rules (ignoring L

c(ABC  D)  c(AB  CD)  c(A  BCD)

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

• The minimum support of a rule is expressed in

• Let MIS(i) be the MIS value of item i. The minsup of a

• Interestingness measures can be used to prune/rank

• In the original formulation of association rules, support

Contingency table for X  Y

Used to define various measures

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75

+ Pattern expected to be frequent

- Pattern expected to be infrequent

Pattern found to be infrequent

• Need to combine expectation of users with evidence from data

• Once an FP-tree has been constructed, it uses a recursive divide-and-

Header table D:1

You might also like