Mod 4 Part1 - Merged
Mod 4 Part1 - Merged
▪ Part - 1
▪ Association Rules-Introduction
▪ Methods to discover Association rules
▪ Apriori(Level-wise algorithm)
Visualization:
https://athena.ecs.csus.edu/~mei/associationcw/Association.html
▪ Part - 2 (Advanced Frequent Itemset Mining Algorithms)
▪ Partition Algorithm
▪ FP-tree Growth Algorithm.
▪ Pincer Search Algorithm,
▪ Dynamic Itemset Counting Algorithm
1
KTU
4.1.1. Discuss the significance of association rule mining in market
basket analysis. (3)
4.1.2. Define support, confidence, and frequent itemset, in association
rule mining context. (3)
2
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ market basket analysis, cross-marketing, catalog design, sale
campaign analysis, web log (click stream) analysis, and DNA
sequence analysis, plagiarism check
3
Market Basket
TID List of Items
101 Litchi, Hill Banana, Strawberry
102 Litchi, Passion Fruit
103 Passion Fruit, Tomato
104 Litchi, Hill Banana, Strawberry
105 Pears, Strawberry
106 Pears
107 Pears, Passion Fruit
108 Litchi, Hill Banana, Watermelon, Strawberry
109 Watermelon, Tomato
110 Litchi, Hill Banana
4
Market Basket
5
Significance of Association Rule Mining in Market
Basket Analysis
6
Association Rule Mining –
Support Transaction
Items
ID
• I={i1, i2, ..., in}: a set of all the items 10 A, C, D
• Transaction t: 20 B, C, E
• a set of items such that t I 30 A, B, C, E
9
Association Rule 3, 5, 8
Mining – Confidence 2, 6, 8
1, 4, 7, 10
◼ Support count of A = n(A)
3, 8, 10
◼ An itemset A is frequent if A’s
2, 5, 8
support count is >= a min-sup
threshold 1, 5, 6
◼ An Association Rule: is an 4, 5, 6, 8
implication of the form 2, 3, 4
A B, where A, B I 1, 5, 7, 8
◼ Confidence A B = P(B/A) 3, 8, 9, 10
Confidence (8) => (5)
= Support-Count (A U B) /
Support-Count (A) = Sup Count (8 U 5) /
Sup Count (8) = 4 / 7
10
Support and Confidence Example
3, 5, 8. What is the Confidence of the association
rule {5} => {8}?
2, 6, 8.
1, 4, 7, 10. • Sup Count (5 U 8) =4
12
The Apriori Algorithm
1. Initial step: Set k = 1 . Scan the transaction database once
to obtain frequent 1-item itemset list ‘L1’.
2. Self-join Lk: We can join 2 itemsets if they have k-1
common items and they differ in one item
3. Prune: Select itemsets for which all the immediate subsets
are frequent. Add them to Ck+1, the candidate itemset list
4. Support count: Scan the transaction database and count
the support for each itemset in Ck+1
5. Select the frequent itemsets: Choose itemsets from Ck+1
whose support count meets or exceeds the minimum support
threshold. These form the next frequent itemset Lk+1
6. Iterate: Set k= k+1; Iterate steps 2 to 5, till Lk !=
13
Example of Generation of
Candidate 3-Item Itemsets from L2
◼ Given Frequent 2-Item Itemsets 𝐿2 =
{{𝐴,𝐵},{𝐴,𝐶},{𝐴,𝐷},{𝐵,𝐶},{𝐵,𝐷},{𝐶,𝐷}}, generate Candidate
3-Itemsets (C3)
◼ Join two itemsets only if they share k−1 common items and
they differ in one item
◼ {A,B} and {A,C} → {𝐴,𝐵,𝐶}
◼ {A,B} and {𝐴,𝐷} → {𝐴,𝐵,𝐷}
◼ {A,C} and {A,D} → {𝐴,𝐶,𝐷}
◼ {B,C} and {B,D} → {B,C,D}
◼ {C,D} remains since no other itemset starts with prefix ‘C’.
◼ Candidate 3-Item Itemsets (C3) =
{{A,B,C},{A,B,D},{A,C,D},{B,C,D}}
14
The Apriori Algorithm Illustration …
Consider a transaction database Transaction Items
with four transactions. Assume ID
minimum support is 2. Let us 10 A, C, D
illustrate the Apriori algorithm. 20 B, C, E
The first step is to scan the DB
and generate frequent 1-item 30 A, B, C, E
itemset 40 B, E
C2 sup C2
L2 sup {A, B} 1 2nd scan {A, B}
{A, C} 2 {A, C}
{A, C} 2
{B, C} 2 {A, E}
{A, E} 1
{B, E} 3 {B, C}
{B, C} 2
{C, E} 2
{B, E} 3 {B, E}
{C, E} 2 {C, E}
16
… The Apriori Algorithm Illustration
17
The Apriori Algorithm Illustration – Summary
Itemset sup
Database TDB Itemset sup
{A} 2
{A} 2
Tid Items C1 {B} 3 L1
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
Itemset sup
Itemset
{A, B} 1
Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2
L2 {A, E} 1 {A, C}
{B, C} 2
{A, E}
{B, E} 3
{B, C} 2 C2 C2
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan
Itemset sup
{B, C, E} {B, C, E} 2 L3
18
The Apriori Algorithm Illustration - Termination
➢ minimum support
➢ minimum confidence
A B, where A, B I
• Confidence A B = P(B/A)
20
Association Rule Mining from Frequent Itemsets
s => (l-s)
22
Association Rule Mining Full Example …
Describe the process of generating association rules using an
example frequent 3-item set
▪ Subsets are:- I1, I2, I5, {I1, I2}, {I1, I5}, {I2, I5}, {I1, I2, I5}
L1 L2 L3
Item Sup Item Sup Item Sup
I1 6 I1, I2 4 I1, I2, I3 2
I2 7 I1 , I3 4 I1, I2, I5 2
I3 6 I1 , I5 2
I4 2 I2, I3 4
▪ Assume the table L1 L2 L3
below lists the
support counts of all
Item Sup Item Sup Item Sup
subsets of items I1 6 I1, I2 4 I1, I2, I3 2
I2 7 I1 , I3 4 I1, I2, I5 2
• Confidence A B
= Sup-Count (A u B) / I3 6 I1 , I5 2
Sup-Count (A) I4 2 I2, I3 4
27
• IR (Imbalance Ratio)
EXERCISES
UQP 4.1.4
b) State the Apriori principle in candidate generation. Find out the
frequent item sets with minimum support of 2 using Apriori for the
following data. (8)
UQP 4.1.5 …
A database has six transactions. Let min_sup be 60% and min_conf be
80%. Find frequent itemsets using Apriori algorithm and generate
strong association rules from a three-item dataset. (8)
◼ Find frequent itemset using Apriori algorithm – 4 marks [marks can
31
UQP 4.1.6 …
A database has six transactions. Let min_sup be 33.33% and
min_conf be 60%. Find frequent itemset using Apriori algorithm
and generate strong association rules from the dataset (8)
◼ TID ITEMS Total Transactions = 6
min_sup = 0.33
◼ T1 Cake, Bread, Jam
min_sup_count = 6*.33 = 2
◼ T2 Cake, Bread
◼ T3 Cake, Coke, Chips Itemset {'Bread'} sup_count = 2
Itemset {'Cake'} sup_count = 4
◼ T4 Chips, Coke
Itemset {'Chips'} sup_count = 4
◼ T5 Chips, Jam Itemset {'Coke'} sup_count = 3
◼ T6 Cake, Coke, Chips Itemset {'Jam'} sup_count = 2
… UQP 4.1.6
Association Rules (min_conf to be 60%) :-
33
… UQP 4.1.6
Itemset {'Cake', 'Chips', 'Coke'} sup_count= 2
Not Ok.
{'Cake'} -> {'Chips', 'Coke'} confidence: 0.5 : n(Coke,Cake,Chips) / n(Cake)
{'Chips'} -> {'Cake', 'Coke'} confidence: 0.5 : n(Coke,Cake,Chips) / n(Chips)
Ok.
{Coke} -> {Cake,Chips} confidence: 0.67 : n(Coke,Cake,Chips) / n(Coke)
{Cake,Chips} -> {Coke} confidence: 1.0 : n(Coke,Cake,Chips) / n(Cake,Chips)
{Cake,Coke} -> {Chips} confidence: 1.0 : n(Coke,Cake,Chips) / n(Cake,Coke)
{Chips,Coke} -> {Cake} confidence: 0.67 : n(Coke,Cake,Chips) / n(Chips,Coke)
34
… UQP 4.1.6
Strong Association Rules for min_conf = 60%.
35
Additional Exercise 4.1.7 …
A database has ten transactions. Let the min- Hint:-
sup =30%. Find all frequent itemsets. Let min- Before you start, for
conf = 75%. Demonstrate association rule an easy workout, you
analysis using a frequent 3-item set from the may code the items
exercise. uniquely
TID List of Items
101 Litchi, Hill Banana, Strawberry Hill Banana H
102 Litchi, Passion Fruit Litchi L
103 Passion Fruit, Tomato Passion Fruit P
104 Litchi, Hill Banana, Strawberry Pears R
105 Pears, Strawberry Strawberry S
106 Pears
Tomato T
107 Pears, Passion Fruit
Watermelon W
108 Litchi, Hill Banana, Watermelon, Strawberry
109 Watermelon, Tomato
110 Litchi, Hill Banana
… Additional Exercise 4.1.7
37
… Additional Exercise 4.1.7
Association Rule Example for {'Banana', 'Litchi', 'Plum’}: -
38
Additional Exercise 4.1.8
A database has four transactions. Let min-sup =60%. Find all
frequent itemsets.
Let min-conf = 80%. Demonstrate association rule analysis using
a frequent 3-item set from the exercise.
TID Date Items
T100 10/07/15 {K, A, D,B}
T200 10/07/15 {D, A, C, E, B}
T300 10/07/19 {C, A, B, E}
T400 22/10/10 {B, A, D}
Extras
40
Module 4: (Association Rule Analysis)
▪ Part - 1
▪ Association Rules-Introduction
▪ Methods to discover Association rules
▪ Apriori(Level-wise algorithm)
1
KTU
4.2.1. List the modification methods to improve the efficiency of the
Apriori algorithm(3)
Describe any three methods to improve the efficiency of the
Apriori algorithm. (3)
Describe any three methods to improve the efficiency of the
Apriori algorithm. (3)
2
KTU
4.2.2 Discuss the partitioning algorithm for finding large itemset and
compare its performance with Apriori algorithm. (6)
Explain the partitioning algorithm for finding large itemset and explain
how it removes the disadvantage of Apriori algorithm. (6)
◼ Partitioning algorithm for finding large items (4)
◼ Explain how it removes the disadvantage of Apriori algorithm (2)
4.2.3 Illustrate the working of Pincer Search algorithm with an example
(6)
Illustrate the working of Pincer Search Algorithm with an example. (6)
Illustrate the working of Pincer Search Algorithm with an example. (6)
◼ Pincer Search Algorithm explanation (4)
◼ Illustration with an example (2)
4.2.4. Write about the bi-directional searching technique for pruning in
the pincer search algorithm (3)
3
KTU
4.2.5. Describe the working of the dynamic itemset counting technique
with a suitable example. Specify when to move an itemset from
dashed structures to solid structures. (8)
Describe how the dynamic itemset counting technique works with
a suitable example. Specify when to move an itemset from dashed
structures to solid structures. (6)
◼ dynamic itemset counting technique (4)
◼ explanation with a suitable example (2)
◼ Specify when to move an itemset from dashed structures to
solid structures. (2)
4
0. Apriori algorithm – Challenges and
Improvements
5
Apriori algorithm – Challenges and
Improvements
• Apriori - major computational challenges
6
Partition: Scan Database Only Twice
• Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
• …
102 {yz, qs, wt}
– Frequent 1-itemset: a, b, d, e
Hash Table
– ab is not a candidate 2-itemset if the sum of count of {ab, ad,
ae} is below support threshold
8
Sampling for Frequent Patterns
• Select a sample of original database, mine frequent
patterns within sample using Apriori
10
1. Partitioning algorithm
11
Partitioning Algorithm
for Frequent Item Set Discovery
4.2.2 Discuss the partitioning algorithm for finding large itemset and
compare its performance with Apriori algorithm. (6)
Explain the partitioning algorithm for finding large itemset and
explain how it removes the disadvantage of Apriori algorithm. (6)
◼ Partitioning algorithm for finding large items (4)
◼ Explain how it removes the disadvantage of Apriori algorithm
(2)
Partioning Algorithm Pseudocode - Explanation
1. Initially the database D is logically partitioned into n partitions.
2. Generate local large itemsets: During the first database scan, the
algorithm counts the support of itemsets in each partition, using
algorithms such as Apriori. For each partition pi generate local
frequent itemsets of all lengths, L1i , L2i , … , Lki If the support is
greater than or equal to the minimum local support threshold.
3. Global candidate itemsets: Select all itemsets that are large in at
least one partition to generate the global candidate itemsets CjG
where, j = 1 to k.
4. Generate global frequent itemsets: Count the support of each global
candidate itemset CjG over the entire database. If the support is
greater than or equal to the minimum global support threshold,
include it in the global frequent itemsets LjG
Note that the algorithm reduces the number of database scans to two
13
Partioning Algorithm Illustration
◼ The Table below illustrates the use of the partition algorithm. The
database D is partitioned into three, each containing 2 transactions.
◼ Set minimum local support threshold as 2. Generate local large 1-
item itemset L1 and 2-item itemset L2
◼ (Note:- As the partition size set was too small, frequent 3-item
itemsets were not generated)
Local Large Local Frequent
TID Items Partition
Itemset L1 Itemset L2
T1 I1, I2, I3 1
I2:2, I3:2 {I2,I3}:2
T2 I2, I3, I4 1
T3 I4, I5 2
I4:2
T4 I1, I2, I4 2
T5 I1, I2, I3, I5 3 {I1, I2} :2
I1:2, I2:2, I3:2 {I1, I3} :2
T6 I1, I2, I3, I4 3 {I2, I3} :2
14
Partioning Algorithm Illustration
◼ Set minimum global support threshold as 4.
◼ Count the support of itemsets in the global candidate itemsets C1,
and C2. Generate frequent 1-item itemset L1 and 2-item itemset L2
◼ The frequent itemsets selected: {I1},{I2},{I3},{I4},{I1,I2},{I2,I3}
◼ (Note:- As the partition size set was too small, frequent 3-item
itemsets were not generated)
Global Global Global Frequent
Global
TID Items Candidate Candidate Itemsets (Min-
Support
Itemset C1 Itemset C2 Support=4)
T1 I1, I2, I3 {I1} 4 {I1}
T2 I2, I3, I4 {I2} 5 {I2}
T3 I4, I5 {I3} 4 {I3}
T4 I1, I2, I4 {I4} 4 {I4}
{I1, I2} 4 {I1, I2}
T5 I1, I2, I3, I5
{I1, I3} 3 Not Frequent
T6 I1, I2, I3, I4 {I2, I3} 4 {I2, I3}
15
Advantages of partitioning method
1. Large Itemset Property: A large itemset must be large in at least
one partition. So the focus shifts from analyzing the entire database to
identifying large itemsets within individual partitions, which is less costly.
2. Limited Memory: The memory required by a partition is relatively
less. The count of itemsets to be processed per partition is smaller
compared the entire database - this further reduces the memory needs.
3. Parallel and Distributed Processing: Each partition can be
processed independently, allowing for parallelization. In a distributed
computing environment, each partition can be assigned to a separate
processing unit, enabling efficient utilization of CPU and processing time.
4. Incremental Generation of Association Rules: When new data is
added to the database, only the partitions containing the new entrees
need to be processed to update the association rules. This approach
avoids the need to recompute the association rules, from scratch for the
entire database, saving computational resources and time.
16
2. FP-tree Growth Algorithm
17
FP-tree Growth Algorithm
for Frequent Item Set Discovery
◼ Solved Problem: Description
◼ https://www.geeksforgeeks.org/frequent-pattern-growth-
algorithm/
◼ Solved Problem: Video
◼ (https://www.youtube.com/watch?v=7oGz4PCp9jI)
◼ Frequent Pattern (FP) Growth Algorithm Association Rule Mining
Solved Example by Mahesh Huddar
◼ Solved Problem: Animation Software
◼ Provide the transaction List. The software will demonstrate the
solution, step by step
◼ https://athena.ecs.csus.edu/~mei/associationcw/FpGrowth.html
Drawbacks of Apriori Algorithm
21
The Algorithm TXN ITEMSET
I. Scan DB once, find the frequent T1 E, K, M, N, O, Y
1-items. Sort frequent items in
T2 D, E, K, N, O, Y
descending order of frequency
T3 A, E, K, M
II. Scan DB again, construct FP-tree T4 C, K, M, U, Y
Example T5 C, E, I, K, O
Consider the transaction database
A 1
with 5 transactions composed of 11 L1
C 2
items. Let the min. support be 3. D 1 Item Freq.
STEP - I E 4 K 5
I 1
Scan the database. E 4
K 5
There are 5 frequent 1-items. M 3 M 3
L= {K:5, E:4, M:3, O:4, Y:3}, O 3 O 3
U 1
in the sorted order of frequency Y 3
Y 3
A 1
22
STEP – I Continued
TXN ITEMSET ORDERED ITEMSET K 5
T1 E, K, M, N, O, Y K, E, M, O, Y E 4
T2 D, E, K, N, O, Y K, E, O, Y
M 3
T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y O 3
T5 C, E, I, K, O K, E, O Y 3
23
Step –II (a) Insert ordered item set {K,E,M,O,Y} in FP Tree
TXN ITEMSETS ORDERED ITEMSETS
T1 E, K, M, N, O, Y K, E, M, O, Y
T2 D, E, K, N, O, Y K, E, O, Y
T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y
T5 C, E, I, K, O K, E, O
24
Step –II (b) Insert ordered item set {K,E,O,Y} in FP Tree
25
Step –II (b) Insert ordered item set {K,E,O,Y} in FP Tree
• On inserting O:
• There is no direct link between E & O.
• So, create a new node for ‘O’.
• Link it with E.
• Assign a support count of 1 to O
• On inserting Y
• Create a new node for the item ‘Y’.
• Link it with O.
• Assign a support count of 1 to Y
26
Step –II (c) Insert ordered item set {K,E,M} in FP Tree
ORDERED
TXN ITEMSETS
ITEMSETS
T1 E, K, M, N, O, Y K, E, M, O, Y
T2 D, E, K, N, O, Y K, E, O, Y
T3 A, E, K, M K, E, M
T4 C, K, M, U, Y K, M, Y
T5 C, E, I, K, O K, E, O
• The elements K, E, O
are already present
in order in the Tree.
• So, increase their
support count by 1
• K=5
• E=4
• O=2
29
Conditional Pattern Bases
◼ Make a list of the frequent 1-items in the ascending order of their
frequencies
◼ For each of these items, find out all the paths leading from the
root. These paths are called the conditional pattern base.
◼ Example
◼ ‘Y’ can be reached from the root using the path (K->E->M-
Y 3
O 3
M 3
E 4
K 5
30
Conditional Frequent Pattern Tree
For each item (Y,O,M,E,K), build conditional frequent pattern tree:-
◼ Consider one item at time (e.g., Y)
◼ Take the set of items that are common to all the paths in its
31
Conditional Frequent Pattern Tree
• Consider ‘O’.
• The conditional pattern base of ‘O’ is {KEM:1}, {KE:2}.
• Observe that KE is common to all paths.
• Sum the support count of ‘KE’ in all paths. So the support count
of KE = 1+2 = 3
32
Conditional Frequent Pattern Tree
• Consider ‘M’.
• The conditional pattern base of ‘M’ is {KE:2}, {K:1}.
• Observe that K is common to all paths.
• Sum the support count of ‘K’ in all the paths. So the support
count of K = 2 + 1 = 3
33
Frequent Itemsets
From the Conditional Frequent Pattern tree, the Frequent
Itemsets are generated by pairing itemsets in the ‘Conditional
Frequent Pattern Tree’ with the corresponding frequent 1-item.
◼ For example, consider the first itemset {K} of the conditional
frequent pattern tree. {K} will be paired with ‘Y’.
◼ The frequent pattern that emerges is {K,Y}, with frequency 3
CONDITIONAL
(FREQUENT
FREQUENT PATTERN FREQUENT ITEMSETS
1-ITEM)
TREE
Y {K} : 3 {K,Y} : 3
O {K,E} : 3 [ {K,O}:3, {E,O}:3, {E,K,O} :3 ]
M {K} : 3 {K,M} : 3
E {K} : 4 {E,K} : 3
K
34
Frequent Itemsets
◼ Consider the second itemset {K,E} of the conditional frequent
pattern tree. {K,E} will be paired with ‘O’.
◼ The frequent patterns that emerge are {K,O}, {E,O}, {E,K,O},
each with frequency 3.
◼ Similarly pair {K} with ‘M’. Then pair {K} with ‘E’.
◼ Final Result: ‘frequent itemsets’ listed in the last column.
CONDITIONAL
(FREQUENT
FREQUENT PATTERN FREQUENT ITEMSETS
1-ITEM)
TREE
Y {K} : 3 {K,Y} : 3
O {K,E} : 3 [ {K,O}:3, {E,O}:3, {E,K,O} :3 ]
M {K} : 3 {K,M} : 3
E {K} : 4 {E,K} : 3
K
35
4.2.6 KTU - May 2024: FP Growth
◼ Mention the advantages of FP Growth algorithm. Find
out the frequent item sets using FP Growth for the
following data. (8)
36
4.2.7 KTU - June 2023: FP Growth
◼ A database has six TID ITEMS
transactions. Let min_sup T1 {f, a, c, d, m, p}
be 3. Find frequent itemsets T2 {a, b, c, f, m}
using FP growth algorithm.
◼ Answer:- T3 {b, f, j}
https://web.iitd.ac.in/~bspand
T4 {b, c, k, p}
a/MTL782FPTREE.pdf
(IIT Delhi) T5 {a, f, c, e, p, m}
T6 {f, a, c, d, m, p}
FP Growth IIT
Delhi
37
Ex.4.2.8 FP Ex.4.2.9 FP
(From Apriori Section) (From Apriori Section)
A, C, D A database has four transactions.
B, C, E Let min-sup =60%. Find all
A, B, C, E frequent itemsets using FP Growth
B, E
TID Items
Let min-support =50%. Find
T100 {K, A, D,B}
the frequent itemsets using FP
Growth T200 {D, A, C, E, B}
T300 {C, A, B, E}
Use: - T400 {B, A, D}
https://athena.ecs.csus.edu/~ Use: -
mei/associationcw/FpGrowth.
https://athena.ecs.csus.edu/~mei/
html
associationcw/FpGrowth.html
4.2.10.FP and Association Rules
Note: This was asked in 2023 October – for solution using Apriori
3. Pincer Search Algorithm
40
Pincer Search Algorithm
for Frequent Item Set Discovery
◼ University Questions
◼ Illustrate the working of Pincer Search Algorithm with an example. (6)
◼ Pincer Search Algorithm explanation (4)
◼ Explanation with an example (2)
◼ Write about the bi-directional searching technique for pruning in pincer search
algorithm (3)
◼ Answer: Pincer Search Algorithm Video and Writeup
◼ https://www.youtube.com/watch?v=Rb1gDmeBPxA
DM2 CL7 -PINCER SEARCH algorithm in data mining with
Example(മലയാളത്തിൽ)
◼ Corresponding Writeup:-
◼ https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1000342
◼ Or https://vikramuniv.ac.in/files/wp-content/uploads/MCA-
IV_DataMining18_Pincer_Search_Algorithm_Keerti_Dixit.pdf
Pincer Search and Maximal frequent set
◼ A maximal frequent set is a frequent set where all of its
proper subsets are also frequent. Moreover, all of its proper
supersets are infrequent.
◼ Maximum Frequent Set (MFS) is a collection of all the
maximal frequent itemsets. It's like a master list containing all
the biggest groups of items that are frequent.
◼ The MFS acts as a boundary between the groups of items that are
popular and the ones that aren't. Everything in the MFS is
frequent, and everything outside of it is not.
◼ Instead of trying to find all the frequent itemsets, we focus on finding
the MFS only. From the MFS, we generate all other frequent itemsets.
◼ Once MFS is ready, we can get the count of all frequent items by
scanning the transaction database just once. No need for multiple
database scan for item count as is done in Apriori
42
Pincer Search Method
◼ The key concept behind Pincer-Search is the bi-directional
exploration of the search space.
◼ The top-down search starts with the largest possible
itemsets and gradually prunes them down
◼ The bottom-up search begins with the smallest
itemsets and expands them.
◼ The information gleaned from one direction is shared with the
other, allowing for computationally effective pruning of
candidate itemsets.
◼ Pincer search uses the Apriori algorithm for bottom-up search
approach to identify frequent itemsets of size-1, size-2 and so
on in sequence.
43
Pincer Search Method
◼ Pincer search maintains two special data structures
◼ MFS – Maximum Frequent Set
◼ MFCS, the Maximum Frequent Candidate Set.
◼ The MFCS efficiently identifies maximal frequent item sets of
large length. It starts with a single set of all the items, from
which subsets of frequent item sets are generated by a top-
down method.
◼ The Maximum Frequent set (MFS) comprises all maximally
frequent itemsets. It starts as a null set and is constructed
bottom-up.
◼ When the algorithm terminates, MFS = MFCS.
◼ We generate the subsets of all the sets in MFS - they all will be
frequent.
44
Pincer Search – A simple Example …
◼ Consider the following Problem
◼ Items 1, 2, 3, 4, 5.
◼ Transactions: {1, 3}, {1, 2}, {1, 2, 3, 4}
◼ Minimum Support: 0.5
◼ Initially, MFCS = {1, 2, 3, 4, 5}; MFS ={}.
◼ In the first pass, {1, 2, 3, 4, 5} is the candidate for the top-
down search.
◼ All 1-item itemsets are candidates for the bottom-up search
45
… Pincer Search – A simple Example
◼ Bottom-up:
◼ First pass: consider all 1-itemsets. Do support counting. We
observe that itemset {5} is not frequent.
◼ Second pass:
◼ Eliminate the supersets of {5} from further processing -
the itemsets {1, 5}, {2, 5}, {3, 5}, {4, 5} are discarded.
◼ The itemsets {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, and
{3, 4} are candidates for further processing.
46
… Pincer Search – A simple Example
◼ Top down:
◼ From the bottom-up search, itemset {5} was found infrequent.
◼ So, eliminate the supersets of {5}: MFCS = {1, 2, 3, 4}.
◼ Do support counting. {1, 2, 3, 4} is discovered to be frequent.
◼ Bottom up:
◼ From the top-down search, we found that {1, 2, 3, 4} is
frequent
◼ So all the subsets of {1, 2, 3, 4} must be frequent and they
need not be examined further.
◼ MFS is a frequent set where its proper subsets are frequent
and its proper supersets are infrequent. So, MFS = {1, 2, 3, 4}
47
… Pincer Search – A simple Example
◼ Top down:
◼ MFCS = {1, 2, 3, 4}. This is discovered to be frequent.
◼ Bottom up:
◼ MFS = {1, 2, 3, 4}
◼ MFCS = MFS. The program terminates.
◼ There is one Frequent 4 item itemset, MFS = {1, 2, 3, 4}.
◼ The subsets of this MFS are also frequent. This means:-
◼ Frequent 4 itemsets: {1, 2, 3, 4}.
◼ Frequent 3 itemsets: {1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {2, 3, 4}.
◼ Frequent 2 itemsets:{1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}
◼ Frequent 1 itemsets:{1}, {2}, {3}, {4}
48
4. Dynamic Itemset Counting
algorithm
49
Dynamic Itemset Counting
for Frequent Item Set Discovery
◼ Describe the working of dynamic itemset counting technique
with suitable example. Specify when to move an itemset from
dashed structures to solid structures. (8)
◼ dynamic itemset counting technique (4)
51
DYNAMIC ITEMSET COUNTING – FEATURES
◼ Transaction-based Updates: Support counts of itemsets are updated
dynamically as each transaction is processed. Instead of re-evaluating all
transactions, only the relevant ones are considered for updating support
counts.
52
DIC PROCEDURE …
An itemset lattice contains all the possible itemsets for a
transaction database. Each itemset in the lattice points to all of its
supersets. When represented graphically, an itemset lattice can
help us to understand the concepts behind the DIC algorithm.
Itemsets are marked in four different ways as they are counted:
◼ Dashed circle: suspected infrequent itemset - an itemset we
are still counting that is below min_sup
◼ Dashed box: suspected frequent itemset - an itemset we are
still counting that exceeds min_sup
◼ Solid box: confirmed frequent itemset - an itemset we have
finished counting and exceeds the support threshold min_sup
◼ Solid circle: confirmed infrequent itemset - we have finished
counting, and it is below min_sup
53
DIC PROCEDURE …
1. Initialization:
◼ Start with an empty itemset marked with a solid square.
◼ Mark all 1-itemsets with dashed circles.
2. Iterative Counting:
While there are still dashed itemsets:
◼ Read M transactions at a time from the dataset.
◼ Increase the count of itemsets marked with dashes if they
are present in the transaction.
◼ Update markings based on count thresholds. (PTO)
54
… DIC PROCEDURE
… Step-3 continued
◼ On updating the markings:
If a dashed circle's count exceeds the minimum support
threshold (min_sup),
◼ Turn it into a dashed square.
◼ Check the immediate supersets: If any immediate superset
has all its subsets as squares (either solid or dashed), mark
that superset as a dashed circle and update its count.
◼ Completion:
Once all dashed itemsets has been counted through all the
transactions in the database, make it solid and stop counting
it.
55
DIC – ILLUSTRATION …
Txn Itemsets A B C
T1 A,B 1 1 0
T2 A 1 0 0
T3 B, C 0 1 1
T4 l,θ 0 0 0
https://www.youtube.com/watch?v=SLhLJZK6KaE
https://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/DIC.html
56
… DIC – ILLUSTRATION …
M=2 A B C ◼ Counters: A=0, B=0, C=0
1 1 0 ◼ Empty itemset marked solid box
1M
1 0 0 ◼ All 1-itemsets marked with
0 1 1 dashed circles
2M
0 0 0 ◼ Min-Sup = 25% = 4*.25 = 1
57
After M Transactions are read After 2M Transactions read
Counters: A=2, B=1, C=0, AB=0 Counters: A=2, B=2, C=1, AB=0,
Empty itemset marked in solid box. AC=0, BC=0. By now, the entire
database is read once. A, B, and C
A and B are >= min_sup: Mark in
are changed to solid squares as
dashed square. Mark ‘AB’ the
their sup. >=min_sup
immediate superset in dashed
circle. Mark ‘C’ in dashed circle. Counters are added for AC and BC
58
After 3M Transactions read After 4M Transactions read
Counters: A=2, B=2, C=1, AB=1, Counters: A=2, B=2, C=1, AB=1,
AC=0, BC=0. AB has been AC=0, BC=1. The entire database
counted through all transactions – is read once. AC and BC have
change it from dashed circle to been counted through all the
solid square. Mark BC with dashed transactions. Change their mark
square, as its immediate superset from dashed to solid. Don’t count
of the frequent itemsets B, and C. ABC, as AC is not frequent.
59
… DIC – ILLUSTRATION - Completion
Completion: As all the dashed itemsets has been counted through
all the transactions in the database, stop counting
60
Extras
61
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules
FP Growth Python Code
import pandas as pd
# Given transactions print(frequent_itemsets)
data = [['A', 'B'], support itemsets
['A', 'B', 'C'], 0 0.777778 (B)
['A', 'B', 'C', 'E'], 1 0.777778 (A)
['A', 'B', 'C', 'E'], 2 0.666667 (C)
['A', 'B', 'D'], 3 0.222222 (E)
['A', 'C'], 4 0.222222 (D)
['A', 'C'], 5 0.555556 (A, B)
['B', 'C'], 6 0.555556 (A, C)
['B', 'D']] 7 0.444444 (C, B)
# Create a list of all unique items 8 0.333333 (A, C, B)
items = sorted(set(item for transaction in data for item in 9 0.222222 (E, C)
transaction))
10 0.222222 (A, E)
# Convert transactions into a DataFrame with binary representation
11 0.222222 (E, B)
df = pd.DataFrame([{item: (item in transaction) for item in items} for
transaction in data]) 12 0.222222 (A, E, C)
ABCD
◼ Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD ◼ Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman,
and S. Tsur. and implication rules DIC 3-items
for market basket data.
SIGMOD’97
63
Partioning Algorithm
Pseudocode
64