0% found this document useful (0 votes)
85 views104 pages

Mod 4 Part1 - Merged

Module 4 focuses on Association Rule Analysis, covering the introduction to association rules, methods for discovering them, and advanced algorithms like Apriori and FP-tree. It emphasizes the significance of association rule mining in market basket analysis, including identifying product relationships and optimizing marketing strategies. Key concepts such as support, confidence, and frequent itemsets are defined, alongside practical applications in various fields.

Uploaded by

Insha Nourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views104 pages

Mod 4 Part1 - Merged

Module 4 focuses on Association Rule Analysis, covering the introduction to association rules, methods for discovering them, and advanced algorithms like Apriori and FP-tree. It emphasizes the significance of association rule mining in market basket analysis, including identifying product relationships and optimizing marketing strategies. Key concepts such as support, confidence, and frequent itemsets are defined, alongside practical applications in various fields.

Uploaded by

Insha Nourin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Module 4: (Association Rule Analysis)

▪ Part - 1
▪ Association Rules-Introduction
▪ Methods to discover Association rules
▪ Apriori(Level-wise algorithm)
Visualization:
https://athena.ecs.csus.edu/~mei/associationcw/Association.html
▪ Part - 2 (Advanced Frequent Itemset Mining Algorithms)
▪ Partition Algorithm
▪ FP-tree Growth Algorithm.
▪ Pincer Search Algorithm,
▪ Dynamic Itemset Counting Algorithm

1
KTU
4.1.1. Discuss the significance of association rule mining in market
basket analysis. (3)
4.1.2. Define support, confidence, and frequent itemset, in association
rule mining context. (3)

2
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ market basket analysis, cross-marketing, catalog design, sale
campaign analysis, web log (click stream) analysis, and DNA
sequence analysis, plagiarism check

3
Market Basket
TID List of Items
101 Litchi, Hill Banana, Strawberry
102 Litchi, Passion Fruit
103 Passion Fruit, Tomato
104 Litchi, Hill Banana, Strawberry
105 Pears, Strawberry
106 Pears
107 Pears, Passion Fruit
108 Litchi, Hill Banana, Watermelon, Strawberry
109 Watermelon, Tomato
110 Litchi, Hill Banana

4
Market Basket

5
Significance of Association Rule Mining in Market
Basket Analysis

◼ Identifying Product Relationships: Helps retailers discover


which products are often bought together (e.g., bread and
butter, milk and cereal).
◼ Personalized Recommendations: E-commerce platforms use
association rules to suggest related products to customers,
enhancing user experience.
◼ Optimizing Store Layouts: Supermarkets can arrange items
strategically to boost cross-selling opportunities.
◼ Enhancing Marketing Strategies: Helps in targeted promotions,
bundling offers, and discount strategies.
◼ Reducing Inventory Costs: Understanding item correlations
helps in better stock management and demand forecasting.

6
Association Rule Mining –
Support Transaction
Items
ID
• I={i1, i2, ..., in}: a set of all the items 10 A, C, D
• Transaction t: 20 B, C, E
• a set of items such that t  I 30 A, B, C, E

• Transaction Database D: a set of 40 B, E


transactions
Obtain 1-
• A: a set of items A  I item Count
Itemsets Support
• Support of itemset A:
{A} 2
• Frequency of itemset A in D {B} 3
• The probability that a transaction {C} 3
contains A {D} 1
• S = n(A) / n(D) {E} 3
Support Example
3, 5, 8.
Support Count (8) =7
2, 6, 8.
Support of (8) = n(S) / n(D) = 7 / 10
1, 4, 7, 10.
3, 8, 10.
2, 5, 8.
1, 5, 6.
4, 5, 6, 8.
2, 3, 4.
1, 5, 7, 8.
3, 8, 9, 10
Association Rule
Mining – Confidence
3, 5, 8
◼ Support count of A = n(A)
2, 6, 8
◼ An itemset A is frequent if A’s
support count is >= a min-sup 1, 4, 7, 10
threshold 3, 8, 10
◼ An Association Rule: is an 2, 5, 8
implication of the form 1, 5, 6
A  B, where A, B  I 4, 5, 6, 8
◼ Confidence A  B = P(B/A) 2, 3, 4
= Support-Count (A U B) / 1, 5, 7, 8
Support-Count (A) 3, 8, 9, 10

9
Association Rule 3, 5, 8
Mining – Confidence 2, 6, 8
1, 4, 7, 10
◼ Support count of A = n(A)
3, 8, 10
◼ An itemset A is frequent if A’s
2, 5, 8
support count is >= a min-sup
threshold 1, 5, 6
◼ An Association Rule: is an 4, 5, 6, 8
implication of the form 2, 3, 4
A  B, where A, B  I 1, 5, 7, 8
◼ Confidence A  B = P(B/A) 3, 8, 9, 10
Confidence (8) => (5)
= Support-Count (A U B) /
Support-Count (A) = Sup Count (8 U 5) /
Sup Count (8) = 4 / 7

10
Support and Confidence Example
3, 5, 8. What is the Confidence of the association
rule {5} => {8}?
2, 6, 8.
1, 4, 7, 10. • Sup Count (5 U 8) =4

3, 8, 10. • Sup Count (5) =5


2, 5, 8. • Confidence {5} => {8}
1, 5, 6.
Sup Count (5 U 8) / Sup Count (5)
4, 5, 6, 8.
4/5 = 0.8
2, 3, 4.
1, 5, 7, 8.
3, 8, 9, 10
The Apriori Tuning Principle
(and the Downward Closure Property)
◼ The downward closure property of frequent patterns
◼ The downward closure property of frequent patterns states
that any subset of a frequent itemset must also be frequent.
◼ For example, if the itemset {bread, eggs, coffee} is frequent,
then its immediate subsets—{bread, eggs}, {bread, coffee},
and {eggs, coffee}—must also be frequent.
◼ Apriori pruning principle
◼ If an itemset is infrequent, all of its supersets will also be
infrequent and these supersets should not be generated.

12
The Apriori Algorithm
1. Initial step: Set k = 1 . Scan the transaction database once
to obtain frequent 1-item itemset list ‘L1’.
2. Self-join Lk: We can join 2 itemsets if they have k-1
common items and they differ in one item
3. Prune: Select itemsets for which all the immediate subsets
are frequent. Add them to Ck+1, the candidate itemset list
4. Support count: Scan the transaction database and count
the support for each itemset in Ck+1
5. Select the frequent itemsets: Choose itemsets from Ck+1
whose support count meets or exceeds the minimum support
threshold. These form the next frequent itemset Lk+1
6. Iterate: Set k= k+1; Iterate steps 2 to 5, till Lk !=

13
Example of Generation of
Candidate 3-Item Itemsets from L2
◼ Given Frequent 2-Item Itemsets ​𝐿2 =
{{𝐴,𝐵},{𝐴,𝐶},{𝐴,𝐷},{𝐵,𝐶},{𝐵,𝐷},{𝐶,𝐷}}, generate Candidate
3-Itemsets (C3)
◼ Join two itemsets only if they share k−1 common items and
they differ in one item
◼ {A,B} and {A,C} → {𝐴,𝐵,𝐶}
◼ {A,B} and {𝐴,𝐷} → {𝐴,𝐵,𝐷}
◼ {A,C} and {A,D} → {𝐴,𝐶,𝐷}
◼ {B,C} and {B,D} → {B,C,D}
◼ {C,D} remains since no other itemset starts with prefix ‘C’.
◼ Candidate 3-Item Itemsets (C3) =
{{A,B,C},{A,B,D},{A,C,D},{B,C,D}}
14
The Apriori Algorithm Illustration …
Consider a transaction database Transaction Items
with four transactions. Assume ID
minimum support is 2. Let us 10 A, C, D
illustrate the Apriori algorithm. 20 B, C, E
The first step is to scan the DB
and generate frequent 1-item 30 A, B, C, E
itemset 40 B, E

Obtain 1-item Count Select frequent 1- Support


Itemsets Support item itemsets ‘L1’ count
{A} 2 {A} 2
{B} 3
{B} 3
{C} 3
{C} 3
{D} 1
{E} 3 {E} 3
15
… The Apriori Algorithm Illustration
Database TDB Itemset sup L1 sup
Tid Items {A} 2 {A} 2
C1 {B} 3
10 A, C, D {B} 3
20 B, C, E 1st scan {C} 3 {C} 3
30 A, B, C, E {D} 1 {E} 3
40 B, E {E} 3

C2 sup C2
L2 sup {A, B} 1 2nd scan {A, B}
{A, C} 2 {A, C}
{A, C} 2
{B, C} 2 {A, E}
{A, E} 1
{B, E} 3 {B, C}
{B, C} 2
{C, E} 2
{B, E} 3 {B, E}
{C, E} 2 {C, E}

16
… The Apriori Algorithm Illustration

Database TDB L1 sup C2


Tid Items {A} 2 {A, B}
10 A, C, D {B} 3 {A, C}
{C} 3 {A, E}
20 B, C, E
{E} 3 {B, C}
30 A, B, C, E
40 B, E {B, E}
{C, E}
L2 sup
{A, C} 2 C3 (L2 L3 support
join L2)
{B, C} 2 3rd scan
{B, C, E} {B, C, E} 2
{B, E} 3
{C, E} 2

17
The Apriori Algorithm Illustration – Summary
Itemset sup
Database TDB Itemset sup
{A} 2
{A} 2
Tid Items C1 {B} 3 L1
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
Itemset sup
Itemset
{A, B} 1
Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2
L2 {A, E} 1 {A, C}
{B, C} 2
{A, E}
{B, E} 3
{B, C} 2 C2 C2
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan
Itemset sup
{B, C, E} {B, C, E} 2 L3
18
The Apriori Algorithm Illustration - Termination

• L3: {B,C,E} L1 Sup


• When there are no more frequent {A} 2
itemsets to generate, the Apriori {B} 3
algorithm terminates {C} 3
• Output: all the frequent itemsets {E} 3
for min-sup = 2
L2 Sup
1. Frequent 1-item itemsets: {A, C} 2
A,B,C,E {B, C} 2
2. Frequent 2-item itemsets: {B, E} 3
{A,C}, {B,C}, {B,E}, {C,E} {C, E} 2
3. Frequent 3-item itemsets: L3 sup
{B,C,E}
{B, C, E} 2
19
Association Rule Mining
• Strong association rules satisfy both

➢ minimum support

➢ minimum confidence

• Support count of A = n(A)

• An association rule is an implication of the form

A  B, where A, B  I

• Confidence A  B = P(B/A)

= Support-Count (A U B) / Support-Count (A)

20
Association Rule Mining from Frequent Itemsets

▪ For each frequent itemset l, generate all nonempty subsets of l.

▪ For every nonempty subset s of l, create the association rule

s => (l-s)

▪ The association rule is considered strong if it meets or


exceeds the confidence threshold, i.e., [support count (l) /
support count (s)] >= minimum confidence threshold
Association Rule Mining Example
◼ Consider the frequent item set {K,Y}
The association rules are: - K -> Y and Y -> K
◼ Consider the frequent item set {E,K,O}. The association rules
are:
E -> O, O -> E,
K -> O, O -> K,
E -> K, K -> E,
{E, K} -> O,
{E, O} -> K,
{K, O} -> E

22
Association Rule Mining Full Example …
Describe the process of generating association rules using an
example frequent 3-item set

▪ Consider a frequent itemset {I1, I2, I5}

▪ Subsets are:- I1, I2, I5, {I1, I2}, {I1, I5}, {I2, I5}, {I1, I2, I5}

▪ The association rules, s => (l-s)

I1 => {I2, I5}, I2 => {I1, I5}, I5 => {I1, I2}

{I1, I5} => I2, {I2, I5} => I1


… Association Rule Mining Full Example
▪ Consider the frequent itemset {I1, I2, I5}. It’s subsets are:-
I1, I2, I5, {I1, I2}, {I1, I5}, {I2, I5}, {I1, I2, I5}
▪ Assume the following table lists the support counts of all the
above subsets of items.

L1 L2 L3
Item Sup Item Sup Item Sup
I1 6 I1, I2 4 I1, I2, I3 2
I2 7 I1 , I3 4 I1, I2, I5 2
I3 6 I1 , I5 2
I4 2 I2, I3 4
▪ Assume the table L1 L2 L3
below lists the
support counts of all
Item Sup Item Sup Item Sup
subsets of items I1 6 I1, I2 4 I1, I2, I3 2
I2 7 I1 , I3 4 I1, I2, I5 2
• Confidence A  B
= Sup-Count (A u B) / I3 6 I1 , I5 2
Sup-Count (A) I4 2 I2, I3 4

AR for {I1, I2, I5} Confidence = s(A u B) / s(A) Confidence


{I1, I2} => I5 n(I1,I2,I5) / n(I1,I2) 2 / 4 = 50%
{I1, I5} => I2 n(I1,I2,I5) / n(I1,I5) 2 / 2 = 100%
{I2, I5} => I1 n(I1,I2,I5) / n(I2,I5) 2 / 2 = 100%
I1 => {I2, I5} n(I1,I2,I5) / n(I1) 2 / 6 = 33%
I2 => {I1, I5} n(I1,I2,I5) / n(I2) 2 / 7 = 29%
I5 => {I1, I2} n(I1,I2,I5) / n(I5) 2 / 2 = 100%
… Association Rule Mining Full Example

▪ Assume a minimum confidence of 70%.

Association Rule Strength of Association Confidence


{I1, I2} => I5 Weak Association 2 / 4 = 50%
{I1, I5} => I2 Strong Association 2 / 2 = 100%
{I2, I5} => I1 Strong Association 2 / 2 = 100%
I1 => {I2, I5} Weak Association 2 / 6 = 33%
I2 => {I1, I5} Weak Association 2 / 7 = 29%
I5 => {I1, I2} Strong Association 2 / 2 = 100%
Comparison of Interestingness
• Support and Confidence
– “Buy rice  buy milk [1%, 80%]” is misleading if many
customers buy milk
• Null transactions:
– Transactions that does NOT contain any of the item sets
being examined, are called Null transactions
– Null- invariance is crucial for correlation analysis
– Support and confidence are not null invariant.
– Use the following two null-invariant measures to get a
better understanding of correlations
• Kulczynski Measure

27
• IR (Imbalance Ratio)
EXERCISES
UQP 4.1.4
b) State the Apriori principle in candidate generation. Find out the
frequent item sets with minimum support of 2 using Apriori for the
following data. (8)
UQP 4.1.5 …
A database has six transactions. Let min_sup be 60% and min_conf be
80%. Find frequent itemsets using Apriori algorithm and generate
strong association rules from a three-item dataset. (8)
◼ Find frequent itemset using Apriori algorithm – 4 marks [marks can

be given for the correct steps to the solution]


◼ Generate strong association rules from the dataset – 3 marks

[marks can be given for the correct steps to the solution]


TID items bought
T1 I1, I2, I3
T2 I2, I3, I4
T3 I4, I5
T4 I1, I2, I4
T5 I1, I2, I3, I5
T6 I1, I2, I3, I4
{I1} support count= 4 Let min_sup = 0.6 or support = 3 (out of 5 transactions)
{I2} support count= 5
{I3} support count= 4 Frequent 2-Item Itemsets:-
{I4} support count= 4 n(I1, I2) = 4
{I5} support count= 2 n(I3, I2) = 4
{I1, I2} support count= 4
{I1, I3} support count= 3
{I1, I4} support count= 2
Also Note:-
n(I1) = 4
… UQP 4.1.5
{I5, I1} support count= 1 n(I2) = 5
{I3, I2} support count= 4 n(I3) = 4
{I4, I2} support count= 3
{I5, I2} support count= 1 Note:- Strong association rules require high support, and
{I4, I3} support count= 2 high confidence. The rule must be also business relevant
{I5, I3} support count= 1
{I5, I4} support count= 1
Let min_conf = 0.8;
{I1, I3, I2} support count= 3 Strong Association Rules are :-
{I1, I4, I2} support count= 2
Strong Association Rules for the Itemset {I1, I2}:-
{I5, I1, I2} support count= 1
{I1, I4, I3} support count= 1
{I1} -> {I2} confidence: 1.0 <- n(I1, I2)/ n(I1)
{I5, I1, I3} support count= 1 {I2} -> {I1} confidence: 0.8 <- n(I1, I2)/ n(I2)
{I4, I3, I2} support count= 2
Strong Association Rules for the Itemset {I3, I2}:-
{I5, I3, I2} support count= 1
{I1, I4, I3, I2} support count= 1
{I2} -> {I3} confidence: 0.8 <- n(I2, I3)/ n(I2)
{I5, I1, I3, I2} support count= 1 {I3} -> {I2} confidence: 1.0 <- n(I2, I3)/ n(I3)

31
UQP 4.1.6 …
A database has six transactions. Let min_sup be 33.33% and
min_conf be 60%. Find frequent itemset using Apriori algorithm
and generate strong association rules from the dataset (8)
◼ TID ITEMS Total Transactions = 6
min_sup = 0.33
◼ T1 Cake, Bread, Jam
min_sup_count = 6*.33 = 2
◼ T2 Cake, Bread
◼ T3 Cake, Coke, Chips Itemset {'Bread'} sup_count = 2
Itemset {'Cake'} sup_count = 4
◼ T4 Chips, Coke
Itemset {'Chips'} sup_count = 4
◼ T5 Chips, Jam Itemset {'Coke'} sup_count = 3
◼ T6 Cake, Coke, Chips Itemset {'Jam'} sup_count = 2
… UQP 4.1.6
Association Rules (min_conf to be 60%) :-

Itemset {'Cake', 'Bread'} sup_count= 2


Ok. {'Bread'} -> {'Cake'} confidence: 1.0 <- n(Bread, Cake) / n(Bread) = 2/2 = 1
Not Ok. {'Cake'} -> {'Bread'} confidence: 0.5 <- n(Bread, Cake) / n(Cake) = 2/4 = 0.5

Itemset {'Cake', 'Chips'} sup_count= 2


Not Ok. {'Cake'} -> {'Chips'} confidence: 0.5 <- n(Cake, Chips) / n(Cake) = 2/4 = 0.5
Not Ok. {'Chips'} -> {'Cake'} confidence: 0.5 <- n(Cake, Chips) / n(Chips) = 2/4 = 0.5

Itemset {'Cake', 'Coke'} sup_count= 2


Not Ok. {'Cake'} -> {'Coke'} confidence: 0.5 <- n(Cake, Coke) / n(Cake) = 2/4 = 0.5
Ok. {'Coke'} -> {'Cake'} confidence: 0.67 <- n(Cake, Coke) / n(Coke) = 3/4 = 0.67

Itemset {'Chips', 'Coke'} sup_count= 3


Ok. {'Chips'} -> {'Coke'} confidence: 0.75 <- n(Chips, Coke) / n(Chips) = 3/4 = 0.75
Ok. {'Coke'} -> {'Chips'} confidence: 1.0 <- n(Chips, Coke) / n(Coke) = 3/3 = 1

33
… UQP 4.1.6
Itemset {'Cake', 'Chips', 'Coke'} sup_count= 2

Not Ok.
{'Cake'} -> {'Chips', 'Coke'} confidence: 0.5 : n(Coke,Cake,Chips) / n(Cake)
{'Chips'} -> {'Cake', 'Coke'} confidence: 0.5 : n(Coke,Cake,Chips) / n(Chips)

Ok.
{Coke} -> {Cake,Chips} confidence: 0.67 : n(Coke,Cake,Chips) / n(Coke)
{Cake,Chips} -> {Coke} confidence: 1.0 : n(Coke,Cake,Chips) / n(Cake,Chips)
{Cake,Coke} -> {Chips} confidence: 1.0 : n(Coke,Cake,Chips) / n(Cake,Coke)
{Chips,Coke} -> {Cake} confidence: 0.67 : n(Coke,Cake,Chips) / n(Chips,Coke)

34
… UQP 4.1.6
Strong Association Rules for min_conf = 60%.

Itemset {'Cake', 'Bread'} sup_count= 2


{'Bread'} -> {'Cake'} confidence: 1.0 : n(Bread, Cake) / n(Bread) = 2/2

Itemset {'Cake', 'Coke'} sup_count= 2


{'Coke'} -> {'Cake'} confidence: 0.67 : n(Cake, Coke) / n(Coke) = 3/4

Itemset {'Chips', 'Coke'} sup_count= 3


{'Chips'} -> {'Coke'} confidence: 0.75 : n(Chips, Coke) / n(Chips) = 3/4
{'Coke'} -> {'Chips'} confidence: 1.0 : n(Chips, Coke) / n(Coke) = 3/3

Itemset {'Cake', 'Chips', 'Coke'} sup_count= 2


{Coke} -> {Cake,Chips} confidence: 0.67 : n(Coke,Cake,Chips) / n(Coke)
{Cake,Chips} -> {Coke} confidence: 1.0 : n(Coke,Cake,Chips) / n(Cake,Chips)
{Cake,Coke} -> {Chips} confidence: 1.0 : n(Coke,Cake,Chips) / n(Cake,Coke)
{Chips,Coke} -> {Cake} confidence: 0.67 : n(Coke,Cake,Chips) / n(Chips,Coke)

35
Additional Exercise 4.1.7 …
A database has ten transactions. Let the min- Hint:-
sup =30%. Find all frequent itemsets. Let min- Before you start, for
conf = 75%. Demonstrate association rule an easy workout, you
analysis using a frequent 3-item set from the may code the items
exercise. uniquely
TID List of Items
101 Litchi, Hill Banana, Strawberry Hill Banana H
102 Litchi, Passion Fruit Litchi L
103 Passion Fruit, Tomato Passion Fruit P
104 Litchi, Hill Banana, Strawberry Pears R
105 Pears, Strawberry Strawberry S
106 Pears
Tomato T
107 Pears, Passion Fruit
Watermelon W
108 Litchi, Hill Banana, Watermelon, Strawberry
109 Watermelon, Tomato
110 Litchi, Hill Banana
… Additional Exercise 4.1.7

Frequent Item Sets


{'Banana'} sup_count= 4.0
{'Grapes'} sup_count= 2.0
{'Litchi'} sup_count= 5.0
{'Plum'} sup_count= 3.0
{'Banana', 'Litchi'} sup_count= 4.0
{'Banana', 'Plum'} sup_count= 3.0
{'Litchi', 'Plum'} sup_count= 3.0
{'Banana', 'Litchi', 'Plum'} sup_count= 3.0

37
… Additional Exercise 4.1.7
Association Rule Example for {'Banana', 'Litchi', 'Plum’}: -

Given min-conf = 75%. :-

{'Banana'} -> {'Litchi', 'Plum'} confidence: 0.75

{'Plum'} -> {'Banana', 'Litchi'} confidence: 1.0

{'Banana', 'Litchi'} -> {'Plum'} confidence: 0.75

{'Banana', 'Plum'} -> {'Litchi'} confidence: 1.0

{'Litchi', 'Plum'} -> {'Banana'} confidence: 1.0

38
Additional Exercise 4.1.8
A database has four transactions. Let min-sup =60%. Find all
frequent itemsets.
Let min-conf = 80%. Demonstrate association rule analysis using
a frequent 3-item set from the exercise.
TID Date Items
T100 10/07/15 {K, A, D,B}
T200 10/07/15 {D, A, C, E, B}
T300 10/07/19 {C, A, B, E}
T400 22/10/10 {B, A, D}
Extras

40
Module 4: (Association Rule Analysis)

▪ Part - 1
▪ Association Rules-Introduction
▪ Methods to discover Association rules
▪ Apriori(Level-wise algorithm)

▪ Part - 2 (Advanced Frequent Itemset Mining)


1. Partitioning Algorithm
2. FP-tree Growth Algorithm.
Visualization:
https://athena.ecs.csus.edu/~mei/associationcw/Association.html
3. Pincer Search Algorithm,
4. Dynamic Itemset Counting Algorithm

1
KTU
4.2.1. List the modification methods to improve the efficiency of the
Apriori algorithm(3)
Describe any three methods to improve the efficiency of the
Apriori algorithm. (3)
Describe any three methods to improve the efficiency of the
Apriori algorithm. (3)

2
KTU
4.2.2 Discuss the partitioning algorithm for finding large itemset and
compare its performance with Apriori algorithm. (6)
Explain the partitioning algorithm for finding large itemset and explain
how it removes the disadvantage of Apriori algorithm. (6)
◼ Partitioning algorithm for finding large items (4)
◼ Explain how it removes the disadvantage of Apriori algorithm (2)
4.2.3 Illustrate the working of Pincer Search algorithm with an example
(6)
Illustrate the working of Pincer Search Algorithm with an example. (6)
Illustrate the working of Pincer Search Algorithm with an example. (6)
◼ Pincer Search Algorithm explanation (4)
◼ Illustration with an example (2)
4.2.4. Write about the bi-directional searching technique for pruning in
the pincer search algorithm (3)

3
KTU
4.2.5. Describe the working of the dynamic itemset counting technique
with a suitable example. Specify when to move an itemset from
dashed structures to solid structures. (8)
Describe how the dynamic itemset counting technique works with
a suitable example. Specify when to move an itemset from dashed
structures to solid structures. (6)
◼ dynamic itemset counting technique (4)
◼ explanation with a suitable example (2)
◼ Specify when to move an itemset from dashed structures to
solid structures. (2)

Note:- FP-tree Growth Algorithm – Problems included at the end

4
0. Apriori algorithm – Challenges and
Improvements

5
Apriori algorithm – Challenges and
Improvements
• Apriori - major computational challenges

– Multiple scans of the transaction database

– Huge number of candidate item sets

– High workload of support counting for candidates

• Improving Apriori: general ideas

– Reduce passes of transaction database scans

– Reduce the number of candidate item sets generated

– Improve support counting of candidates

6
Partition: Scan Database Only Twice
• Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB

• Scan Database Only Twice

– Scan 1: Partition database. Find local frequent patterns

– Scan 2: Consolidate global frequent patterns

DB1 + DB2 + + DBk = DB


sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Direct Hashing and Pruning to Reduce the
Number of Candidates Generated
• A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
– Candidates: a, b, c, d, e count itemsets
35 {ab, ad, ae}
– Hash enTrees 88 {bd, be, de}
• {ab, ad, ae} . .
.
• {bd, be, de}
.
. .

• …
102 {yz, qs, wt}
– Frequent 1-itemset: a, b, d, e
Hash Table
– ab is not a candidate 2-itemset if the sum of count of {ab, ad,
ae} is below support threshold
8
Sampling for Frequent Patterns
• Select a sample of original database, mine frequent
patterns within sample using Apriori

• Scan database once to verify the frequent itemsets


found from the sample

– Only borders of closure of frequent patterns are checked.

– Example: check abcd instead of ab, ac, …, etc.

• Scan database again to find missed frequent patterns

• Sampling for Frequent Patterns improves the efficiency of


frequent pattern mining by reducing database scans and
minimizing candidate itemsets.
9
Transaction reduction

• Remove transactions that do not contain frequent items in


later scans

• Direct Hashing and Pruning algorithm uses this technique.

10
1. Partitioning algorithm

11
Partitioning Algorithm
for Frequent Item Set Discovery
4.2.2 Discuss the partitioning algorithm for finding large itemset and
compare its performance with Apriori algorithm. (6)
Explain the partitioning algorithm for finding large itemset and
explain how it removes the disadvantage of Apriori algorithm. (6)
◼ Partitioning algorithm for finding large items (4)
◼ Explain how it removes the disadvantage of Apriori algorithm
(2)
Partioning Algorithm Pseudocode - Explanation
1. Initially the database D is logically partitioned into n partitions.
2. Generate local large itemsets: During the first database scan, the
algorithm counts the support of itemsets in each partition, using
algorithms such as Apriori. For each partition pi generate local
frequent itemsets of all lengths, L1i , L2i , … , Lki If the support is
greater than or equal to the minimum local support threshold.
3. Global candidate itemsets: Select all itemsets that are large in at
least one partition to generate the global candidate itemsets CjG
where, j = 1 to k.
4. Generate global frequent itemsets: Count the support of each global
candidate itemset CjG over the entire database. If the support is
greater than or equal to the minimum global support threshold,
include it in the global frequent itemsets LjG
Note that the algorithm reduces the number of database scans to two

13
Partioning Algorithm Illustration

◼ The Table below illustrates the use of the partition algorithm. The
database D is partitioned into three, each containing 2 transactions.
◼ Set minimum local support threshold as 2. Generate local large 1-
item itemset L1 and 2-item itemset L2
◼ (Note:- As the partition size set was too small, frequent 3-item
itemsets were not generated)
Local Large Local Frequent
TID Items Partition
Itemset L1 Itemset L2
T1 I1, I2, I3 1
I2:2, I3:2 {I2,I3}:2
T2 I2, I3, I4 1
T3 I4, I5 2
I4:2
T4 I1, I2, I4 2
T5 I1, I2, I3, I5 3 {I1, I2} :2
I1:2, I2:2, I3:2 {I1, I3} :2
T6 I1, I2, I3, I4 3 {I2, I3} :2

14
Partioning Algorithm Illustration
◼ Set minimum global support threshold as 4.
◼ Count the support of itemsets in the global candidate itemsets C1,
and C2. Generate frequent 1-item itemset L1 and 2-item itemset L2
◼ The frequent itemsets selected: {I1},{I2},{I3},{I4},{I1,I2},{I2,I3}
◼ (Note:- As the partition size set was too small, frequent 3-item
itemsets were not generated)
Global Global Global Frequent
Global
TID Items Candidate Candidate Itemsets (Min-
Support
Itemset C1 Itemset C2 Support=4)
T1 I1, I2, I3 {I1} 4 {I1}
T2 I2, I3, I4 {I2} 5 {I2}
T3 I4, I5 {I3} 4 {I3}
T4 I1, I2, I4 {I4} 4 {I4}
{I1, I2} 4 {I1, I2}
T5 I1, I2, I3, I5
{I1, I3} 3 Not Frequent
T6 I1, I2, I3, I4 {I2, I3} 4 {I2, I3}
15
Advantages of partitioning method
1. Large Itemset Property: A large itemset must be large in at least
one partition. So the focus shifts from analyzing the entire database to
identifying large itemsets within individual partitions, which is less costly.
2. Limited Memory: The memory required by a partition is relatively
less. The count of itemsets to be processed per partition is smaller
compared the entire database - this further reduces the memory needs.
3. Parallel and Distributed Processing: Each partition can be
processed independently, allowing for parallelization. In a distributed
computing environment, each partition can be assigned to a separate
processing unit, enabling efficient utilization of CPU and processing time.
4. Incremental Generation of Association Rules: When new data is
added to the database, only the partitions containing the new entrees
need to be processed to update the association rules. This approach
avoids the need to recompute the association rules, from scratch for the
entire database, saving computational resources and time.
16
2. FP-tree Growth Algorithm

17
FP-tree Growth Algorithm
for Frequent Item Set Discovery
◼ Solved Problem: Description
◼ https://www.geeksforgeeks.org/frequent-pattern-growth-
algorithm/
◼ Solved Problem: Video
◼ (https://www.youtube.com/watch?v=7oGz4PCp9jI)
◼ Frequent Pattern (FP) Growth Algorithm Association Rule Mining
Solved Example by Mahesh Huddar
◼ Solved Problem: Animation Software
◼ Provide the transaction List. The software will demonstrate the
solution, step by step
◼ https://athena.ecs.csus.edu/~mei/associationcw/FpGrowth.html
Drawbacks of Apriori Algorithm

◼ The Apriori Algorithm has two costly drawbacks


◼ Breadth-first (i.e., level-wise) search
◼ Candidate generation and counting
◼ At each step, candidate sets have to be built. Often
generates a huge number of candidates
◼ To count the frequency of candidate sets, the algorithm
repeatedly scans the database.
FP-tree Growth Algorithm
◼ The Frequent Pattern Growth Algorithm features the
following. These features reduce the bottlenecks of Apriori.
◼ Depth-first search
◼ Avoiding explicit candidate generation and counting.
Only two database scans needed. This is particularly
advantageous for large datasets.
◼ The FP-Growth algorithm utilizes a compact data structure
called ‘Tree" to store transactions in a compressed format,
aiding in efficient frequent itemset discovery.
◼ The FP-Tree is a compact, FP-TREE
prefix-based representation of Txn Itemsets
transaction data. T1 E, K, M, N, O, Y
◼ The root is an empty node. T2 D, E, K, N, O, Y
◼ Each node in the FP-Tree T3 A, E, K, M
corresponds to an item in a T4 C, K, M, U, Y
transaction and maintains a T5 C, E, I, K, O
frequency count.
◼ Traversing from the root to a
leaf node forms an itemset by
concatenating the items along
that path.
◼ In FP-Growth, transactions are
stored as paths in the FP-Tree,
where each node represents
an item and its count.

21
The Algorithm TXN ITEMSET
I. Scan DB once, find the frequent T1 E, K, M, N, O, Y
1-items. Sort frequent items in
T2 D, E, K, N, O, Y
descending order of frequency
T3 A, E, K, M
II. Scan DB again, construct FP-tree T4 C, K, M, U, Y
Example T5 C, E, I, K, O
Consider the transaction database
A 1
with 5 transactions composed of 11 L1
C 2
items. Let the min. support be 3. D 1 Item Freq.
STEP - I E 4 K 5
I 1
Scan the database. E 4
K 5
There are 5 frequent 1-items. M 3 M 3
L= {K:5, E:4, M:3, O:4, Y:3}, O 3 O 3
U 1
in the sorted order of frequency Y 3
Y 3
A 1
22
STEP – I Continued
TXN ITEMSET ORDERED ITEMSET K 5
T1 E, K, M, N, O, Y K, E, M, O, Y E 4
T2 D, E, K, N, O, Y K, E, O, Y
M 3
T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y O 3
T5 C, E, I, K, O K, E, O Y 3

• Order the items within each transaction by their frequency.


• Drop the infrequent items from the item set
• For example, T1:
• T1 {E, K, M, N, O, Y} -> {K, E, M, O, Y}
• The items are rearranged in the order of their frequency.
Item ‘N’ is discarded, as it’s not frequent (The support
count of N is 2, whereas 3 is set as the min. support)

23
Step –II (a) Insert ordered item set {K,E,M,O,Y} in FP Tree
TXN ITEMSETS ORDERED ITEMSETS
T1 E, K, M, N, O, Y K, E, M, O, Y
T2 D, E, K, N, O, Y K, E, O, Y
T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y
T5 C, E, I, K, O K, E, O

• Root node is designated as ‘NULL’


• The Ordered Item Sets are
inserted one after the other
• E.g. {K E M O Y}. Insert each
item in the itemset as per order.
• Mark the inserted nodes with ’1’
– the number of times the node
is visited (support)

24
Step –II (b) Insert ordered item set {K,E,O,Y} in FP Tree

TXN ITEMSETS ORDERED ITEMSETS


T1 E, K, M, N, O, Y K, E, M, O, Y
T2 D, E, K, N, O, Y K, E, O, Y

• The elements K and E are


already present in order in the
Tree.
• So, increase their support count
by 1
• K=2
• E=2

25
Step –II (b) Insert ordered item set {K,E,O,Y} in FP Tree

TXN ITEMSETS ORDERED ITEMSETS


T1 E, K, M, N, O, Y K, E, M, O, Y
T2 D, E, K, N, O, Y K, E, O, Y

• On inserting O:
• There is no direct link between E & O.
• So, create a new node for ‘O’.
• Link it with E.
• Assign a support count of 1 to O
• On inserting Y
• Create a new node for the item ‘Y’.
• Link it with O.
• Assign a support count of 1 to Y

26
Step –II (c) Insert ordered item set {K,E,M} in FP Tree
ORDERED
TXN ITEMSETS
ITEMSETS
T1 E, K, M, N, O, Y K, E, M, O, Y
T2 D, E, K, N, O, Y K, E, O, Y
T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y
T5 C, E, I, K, O K, E, O

• The elements K, E, and M


are already present, in
order, in the Tree.
• So, increase their support
count by 1
• K = 2+1 = 3
• E = 2+1 = 3
• M = 1+1 = 2
27
Step –II (d) Insert {K,M,Y} TXN ITEMSETS
ORDERED
ITEMSETS
• K is already present in T1 E, K, M, N, O, Y K, E, M, O, Y
order in the Tree. So, T2 D, E, K, N, O, Y K, E, O, Y
increase its support T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y
count by 1 (K = 4)
T5 C, E, I, K, O K, E, O
• On inserting M
• There is no direct link
between K and M.
• So, create a new
node for ‘M’. Link it
with ‘K’
• Assign M = 1
• On inserting Y
• Create a new node
‘Y’. Link it with M.
• Assign Y = 1
28
Step –II (e) Insert {K,E,O} in FP Tree
ORDERED
TXN ITEMSETS
ITEMSETS
T1 E, K, M, N, O, Y K, E, M, O, Y
T2 D, E, K, N, O, Y K, E, O, Y
T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y
T5 C, E, I, K, O K, E, O

• The elements K, E, O
are already present
in order in the Tree.
• So, increase their
support count by 1
• K=5
• E=4
• O=2

29
Conditional Pattern Bases
◼ Make a list of the frequent 1-items in the ascending order of their
frequencies
◼ For each of these items, find out all the paths leading from the
root. These paths are called the conditional pattern base.
◼ Example
◼ ‘Y’ can be reached from the root using the path (K->E->M-

>O). So K,E,M,O is a conditional pattern base for item ‘Y’.


◼ ‘Y’ can be reached using the paths KEMO, KEO, and KM. The 3

paths together form the conditional pattern base for item ‘Y

Y 3
O 3
M 3
E 4
K 5
30
Conditional Frequent Pattern Tree
For each item (Y,O,M,E,K), build conditional frequent pattern tree:-
◼ Consider one item at time (e.g., Y)

◼ Take the set of items that are common to all the paths in its

conditional pattern base.


◼ Calculate the support count of common items by summing

the support counts in all the paths.


◼ E.g., Consider ‘Y’. The conditional pattern base of ‘Y’ is

{KEMO:1}, {KEO:1}, {KM:1}.


◼ Observe that K is common to all paths. Sum the support count
‘K’ in all the paths. So the support count of K = 1+1+1 = 3

31
Conditional Frequent Pattern Tree
• Consider ‘O’.
• The conditional pattern base of ‘O’ is {KEM:1}, {KE:2}.
• Observe that KE is common to all paths.
• Sum the support count of ‘KE’ in all paths. So the support count
of KE = 1+2 = 3

32
Conditional Frequent Pattern Tree
• Consider ‘M’.
• The conditional pattern base of ‘M’ is {KE:2}, {K:1}.
• Observe that K is common to all paths.
• Sum the support count of ‘K’ in all the paths. So the support
count of K = 2 + 1 = 3

33
Frequent Itemsets
From the Conditional Frequent Pattern tree, the Frequent
Itemsets are generated by pairing itemsets in the ‘Conditional
Frequent Pattern Tree’ with the corresponding frequent 1-item.
◼ For example, consider the first itemset {K} of the conditional
frequent pattern tree. {K} will be paired with ‘Y’.
◼ The frequent pattern that emerges is {K,Y}, with frequency 3

CONDITIONAL
(FREQUENT
FREQUENT PATTERN FREQUENT ITEMSETS
1-ITEM)
TREE
Y {K} : 3 {K,Y} : 3
O {K,E} : 3 [ {K,O}:3, {E,O}:3, {E,K,O} :3 ]
M {K} : 3 {K,M} : 3
E {K} : 4 {E,K} : 3
K
34
Frequent Itemsets
◼ Consider the second itemset {K,E} of the conditional frequent
pattern tree. {K,E} will be paired with ‘O’.
◼ The frequent patterns that emerge are {K,O}, {E,O}, {E,K,O},
each with frequency 3.
◼ Similarly pair {K} with ‘M’. Then pair {K} with ‘E’.
◼ Final Result: ‘frequent itemsets’ listed in the last column.
CONDITIONAL
(FREQUENT
FREQUENT PATTERN FREQUENT ITEMSETS
1-ITEM)
TREE
Y {K} : 3 {K,Y} : 3
O {K,E} : 3 [ {K,O}:3, {E,O}:3, {E,K,O} :3 ]
M {K} : 3 {K,M} : 3
E {K} : 4 {E,K} : 3
K
35
4.2.6 KTU - May 2024: FP Growth
◼ Mention the advantages of FP Growth algorithm. Find
out the frequent item sets using FP Growth for the
following data. (8)

4.2.6 KTU - May


2024: FP Growth

36
4.2.7 KTU - June 2023: FP Growth
◼ A database has six TID ITEMS
transactions. Let min_sup T1 {f, a, c, d, m, p}
be 3. Find frequent itemsets T2 {a, b, c, f, m}
using FP growth algorithm.
◼ Answer:- T3 {b, f, j}
https://web.iitd.ac.in/~bspand
T4 {b, c, k, p}
a/MTL782FPTREE.pdf
(IIT Delhi) T5 {a, f, c, e, p, m}
T6 {f, a, c, d, m, p}

FP Growth IIT
Delhi

37
Ex.4.2.8 FP Ex.4.2.9 FP
(From Apriori Section) (From Apriori Section)
A, C, D A database has four transactions.
B, C, E Let min-sup =60%. Find all
A, B, C, E frequent itemsets using FP Growth
B, E
TID Items
Let min-support =50%. Find
T100 {K, A, D,B}
the frequent itemsets using FP
Growth T200 {D, A, C, E, B}
T300 {C, A, B, E}
Use: - T400 {B, A, D}
https://athena.ecs.csus.edu/~ Use: -
mei/associationcw/FpGrowth.
https://athena.ecs.csus.edu/~mei/
html
associationcw/FpGrowth.html
4.2.10.FP and Association Rules

A database has six transactions. Let min_sup be 60% and min_conf be


80%. Find frequent itemsets using FP Tree growth algorithm and
generate strong association rules. (8)

TID items bought


T1 I1, I2, I3
T2 I2, I3, I4
T3 I4, I5 FP and
Association Rules
T4 I1, I2, I4
T5 I1, I2, I3, I5
T6 I1, I2, I3, I4

Note: This was asked in 2023 October – for solution using Apriori
3. Pincer Search Algorithm

40
Pincer Search Algorithm
for Frequent Item Set Discovery
◼ University Questions
◼ Illustrate the working of Pincer Search Algorithm with an example. (6)
◼ Pincer Search Algorithm explanation (4)
◼ Explanation with an example (2)
◼ Write about the bi-directional searching technique for pruning in pincer search
algorithm (3)
◼ Answer: Pincer Search Algorithm Video and Writeup
◼ https://www.youtube.com/watch?v=Rb1gDmeBPxA
DM2 CL7 -PINCER SEARCH algorithm in data mining with
Example(മലയാളത്തിൽ)
◼ Corresponding Writeup:-
◼ https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1000342
◼ Or https://vikramuniv.ac.in/files/wp-content/uploads/MCA-
IV_DataMining18_Pincer_Search_Algorithm_Keerti_Dixit.pdf
Pincer Search and Maximal frequent set
◼ A maximal frequent set is a frequent set where all of its
proper subsets are also frequent. Moreover, all of its proper
supersets are infrequent.
◼ Maximum Frequent Set (MFS) is a collection of all the
maximal frequent itemsets. It's like a master list containing all
the biggest groups of items that are frequent.
◼ The MFS acts as a boundary between the groups of items that are
popular and the ones that aren't. Everything in the MFS is
frequent, and everything outside of it is not.
◼ Instead of trying to find all the frequent itemsets, we focus on finding
the MFS only. From the MFS, we generate all other frequent itemsets.
◼ Once MFS is ready, we can get the count of all frequent items by
scanning the transaction database just once. No need for multiple
database scan for item count as is done in Apriori

42
Pincer Search Method
◼ The key concept behind Pincer-Search is the bi-directional
exploration of the search space.
◼ The top-down search starts with the largest possible
itemsets and gradually prunes them down
◼ The bottom-up search begins with the smallest
itemsets and expands them.
◼ The information gleaned from one direction is shared with the
other, allowing for computationally effective pruning of
candidate itemsets.
◼ Pincer search uses the Apriori algorithm for bottom-up search
approach to identify frequent itemsets of size-1, size-2 and so
on in sequence.

43
Pincer Search Method
◼ Pincer search maintains two special data structures
◼ MFS – Maximum Frequent Set
◼ MFCS, the Maximum Frequent Candidate Set.
◼ The MFCS efficiently identifies maximal frequent item sets of
large length. It starts with a single set of all the items, from
which subsets of frequent item sets are generated by a top-
down method.
◼ The Maximum Frequent set (MFS) comprises all maximally
frequent itemsets. It starts as a null set and is constructed
bottom-up.
◼ When the algorithm terminates, MFS = MFCS.
◼ We generate the subsets of all the sets in MFS - they all will be
frequent.

44
Pincer Search – A simple Example …
◼ Consider the following Problem
◼ Items 1, 2, 3, 4, 5.
◼ Transactions: {1, 3}, {1, 2}, {1, 2, 3, 4}
◼ Minimum Support: 0.5
◼ Initially, MFCS = {1, 2, 3, 4, 5}; MFS ={}.
◼ In the first pass, {1, 2, 3, 4, 5} is the candidate for the top-
down search.
◼ All 1-item itemsets are candidates for the bottom-up search

45
… Pincer Search – A simple Example
◼ Bottom-up:
◼ First pass: consider all 1-itemsets. Do support counting. We
observe that itemset {5} is not frequent.
◼ Second pass:
◼ Eliminate the supersets of {5} from further processing -
the itemsets {1, 5}, {2, 5}, {3, 5}, {4, 5} are discarded.
◼ The itemsets {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, and
{3, 4} are candidates for further processing.

46
… Pincer Search – A simple Example
◼ Top down:
◼ From the bottom-up search, itemset {5} was found infrequent.
◼ So, eliminate the supersets of {5}: MFCS = {1, 2, 3, 4}.
◼ Do support counting. {1, 2, 3, 4} is discovered to be frequent.
◼ Bottom up:
◼ From the top-down search, we found that {1, 2, 3, 4} is
frequent
◼ So all the subsets of {1, 2, 3, 4} must be frequent and they
need not be examined further.
◼ MFS is a frequent set where its proper subsets are frequent
and its proper supersets are infrequent. So, MFS = {1, 2, 3, 4}

47
… Pincer Search – A simple Example
◼ Top down:
◼ MFCS = {1, 2, 3, 4}. This is discovered to be frequent.
◼ Bottom up:
◼ MFS = {1, 2, 3, 4}
◼ MFCS = MFS. The program terminates.
◼ There is one Frequent 4 item itemset, MFS = {1, 2, 3, 4}.
◼ The subsets of this MFS are also frequent. This means:-
◼ Frequent 4 itemsets: {1, 2, 3, 4}.
◼ Frequent 3 itemsets: {1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {2, 3, 4}.
◼ Frequent 2 itemsets:{1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}
◼ Frequent 1 itemsets:{1}, {2}, {3}, {4}
48
4. Dynamic Itemset Counting
algorithm

49
Dynamic Itemset Counting
for Frequent Item Set Discovery
◼ Describe the working of dynamic itemset counting technique
with suitable example. Specify when to move an itemset from
dashed structures to solid structures. (8)
◼ dynamic itemset counting technique (4)

◼ explanation with suitable example (2)

◼ explanation to move an itemset from dashed structures to

solid structures (2)


◼ Dynamic Itemset Counting Algorithm
◼ https://www.youtube.com/watch?v=SLhLJZK6KaE
◼ Dynamic Itemset Counting Solved Example Apriori
Algorithm Association Rule Mining by Mahesh Huddar
DYNAMIC ITEMSET COUNTING – OVERVIEW

◼ Dynamic itemset counting offers a more efficient and scalable


approach to discovering frequent itemsets in transactional data,
enabling insights into item associations and patterns with
reduced computational overhead and memory usage.
◼ Unlike traditional static counting methods, which involve
recalculating support counts for the entire dataset, dynamic
itemset counting updates support counts incrementally as each
transaction is processed.
◼ This incremental approach reduces computational overhead
and memory usage, making it particularly suitable for large
transaction databases.

51
DYNAMIC ITEMSET COUNTING – FEATURES
◼ Transaction-based Updates: Support counts of itemsets are updated
dynamically as each transaction is processed. Instead of re-evaluating all
transactions, only the relevant ones are considered for updating support
counts.

◼ Efficient Data Structures: DIC relies on efficient data structures such as


hash tables, trees, or other indexing mechanisms to maintain and update
support counts. These structures facilitate fast reTreeval and updating of
support counts during transaction processing.

◼ Memory Efficiency: Dynamic itemset counting optimizes memory usage by


updating the support counts incrementally and avoiding redundant storage of
transactional data. It focuses on maintaining only the necessary data
structures for efficient counting, leading to reduced memory requirements.

◼ Scalability: Dynamic itemset counting is scalable and can effectively handle


large transaction databases. Its incremental approach allows it to process
transactions efficiently, making it suitable for scenarios where traditional
methods may become impractical due to memory or processing constraints.

52
DIC PROCEDURE …
An itemset lattice contains all the possible itemsets for a
transaction database. Each itemset in the lattice points to all of its
supersets. When represented graphically, an itemset lattice can
help us to understand the concepts behind the DIC algorithm.
Itemsets are marked in four different ways as they are counted:
◼ Dashed circle: suspected infrequent itemset - an itemset we
are still counting that is below min_sup
◼ Dashed box: suspected frequent itemset - an itemset we are
still counting that exceeds min_sup
◼ Solid box: confirmed frequent itemset - an itemset we have
finished counting and exceeds the support threshold min_sup
◼ Solid circle: confirmed infrequent itemset - we have finished
counting, and it is below min_sup

53
DIC PROCEDURE …
1. Initialization:
◼ Start with an empty itemset marked with a solid square.
◼ Mark all 1-itemsets with dashed circles.

2. Iterative Counting:
While there are still dashed itemsets:
◼ Read M transactions at a time from the dataset.
◼ Increase the count of itemsets marked with dashes if they
are present in the transaction.
◼ Update markings based on count thresholds. (PTO)

54
… DIC PROCEDURE
… Step-3 continued
◼ On updating the markings:
If a dashed circle's count exceeds the minimum support
threshold (min_sup),
◼ Turn it into a dashed square.
◼ Check the immediate supersets: If any immediate superset
has all its subsets as squares (either solid or dashed), mark
that superset as a dashed circle and update its count.
◼ Completion:
Once all dashed itemsets has been counted through all the
transactions in the database, make it solid and stop counting
it.

55
DIC – ILLUSTRATION …

◼ Consider the following ◼ Rewrite the transactions


example with four to show the presence of
transactions. items with ‘1’ and absence
◼ We are interested in the with ‘0’
items A, B, C.

Txn Itemsets A B C
T1 A,B 1 1 0
T2 A 1 0 0
T3 B, C 0 1 1
T4 l,θ 0 0 0

https://www.youtube.com/watch?v=SLhLJZK6KaE
https://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/DIC.html

56
… DIC – ILLUSTRATION …
M=2 A B C ◼ Counters: A=0, B=0, C=0
1 1 0 ◼ Empty itemset marked solid box
1M
1 0 0 ◼ All 1-itemsets marked with
0 1 1 dashed circles
2M
0 0 0 ◼ Min-Sup = 25% = 4*.25 = 1

57
After M Transactions are read After 2M Transactions read
Counters: A=2, B=1, C=0, AB=0 Counters: A=2, B=2, C=1, AB=0,
Empty itemset marked in solid box. AC=0, BC=0. By now, the entire
database is read once. A, B, and C
A and B are >= min_sup: Mark in
are changed to solid squares as
dashed square. Mark ‘AB’ the
their sup. >=min_sup
immediate superset in dashed
circle. Mark ‘C’ in dashed circle. Counters are added for AC and BC

58
After 3M Transactions read After 4M Transactions read
Counters: A=2, B=2, C=1, AB=1, Counters: A=2, B=2, C=1, AB=1,
AC=0, BC=0. AB has been AC=0, BC=1. The entire database
counted through all transactions – is read once. AC and BC have
change it from dashed circle to been counted through all the
solid square. Mark BC with dashed transactions. Change their mark
square, as its immediate superset from dashed to solid. Don’t count
of the frequent itemsets B, and C. ABC, as AC is not frequent.

59
… DIC – ILLUSTRATION - Completion
Completion: As all the dashed itemsets has been counted through
all the transactions in the database, stop counting

60
Extras

61
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules
FP Growth Python Code
import pandas as pd
# Given transactions print(frequent_itemsets)
data = [['A', 'B'], support itemsets
['A', 'B', 'C'], 0 0.777778 (B)
['A', 'B', 'C', 'E'], 1 0.777778 (A)
['A', 'B', 'C', 'E'], 2 0.666667 (C)
['A', 'B', 'D'], 3 0.222222 (E)
['A', 'C'], 4 0.222222 (D)
['A', 'C'], 5 0.555556 (A, B)
['B', 'C'], 6 0.555556 (A, C)
['B', 'D']] 7 0.444444 (C, B)
# Create a list of all unique items 8 0.333333 (A, C, B)
items = sorted(set(item for transaction in data for item in 9 0.222222 (E, C)
transaction))
10 0.222222 (A, E)
# Convert transactions into a DataFrame with binary representation
11 0.222222 (E, B)
df = pd.DataFrame([{item: (item in transaction) for item in items} for
transaction in data]) 12 0.222222 (A, E, C)

# Apply FP-Growth algorithm 13 0.222222 (E, C, B)

frequent_itemsets = fpgrowth(df, min_support=2/9, 14 0.222222 (A, E, B)


use_colnames=True) 15 0.222222 (A, E, C, B)
# Display frequent itemsets 16 0.222222 (D, B)
print("Frequent Itemsets:")
print(frequent_itemsets)
62
Dynamic itemset counting
to Reduce Number of Scans

ABCD
◼ Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD ◼ Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. and implication rules DIC 3-items
for market basket data.
SIGMOD’97
63
Partioning Algorithm
Pseudocode

64

You might also like