Data Mining Module 4 Important Topics PYQs
Data Mining Module 4 Important Topics PYQs
PYQs
For more notes visit
https://rtpnotes.vercel.app
Data-Mining-Module-4-Important-Topics-PYQs
1. Describe any three methods to improve the efficiency of the Apriori Algorithm
Three Ways to Make Apriori Faster:
1. Hash-Based Technique
2. Transaction Reduction
3. Partitioning
2. Define support, confidence and frequent itemset in association rule mining context.
1. Support
2. Confidence
3. Frequent Itemset
3. Write about the bi-directional searching technique for pruning in pincer search
algorithm
Let's Break it Down:
In Pincer Search:
How Bi-directional Pruning Works
From the Bottom (Bottom-Up):
From the Top (Top-Down):
Let’s Take a Real Example:
Now comes the key part (bi-directional pruning):
Why is this powerful?
Pincer Search Algorithm:
Steps of the Pincer Search Algorithm:
Step 1: Generate candidate 1-itemsets and compute their support
Step 2: Identify frequent 1-itemsets and extend them to larger itemsets
Step 3: Simultaneously perform a top-down search
Step 4: Prune infrequent itemsets early using MFI knowledge
Step 5: Refine candidate itemsets until no new frequent itemsets are found
It works in steps:
At each level, it scans the entire dataset again — which can be very slow if the dataset is large.
1. Hash-Based Technique
What it means:
Instead of checking every possible itemset, we use a hash table (a fast way of storing and
looking up data) to reduce how many candidate itemsets we need to consider.
How it helps:
When generating 2-item or 3-item sets, we can hash them into buckets. If a bucket has low
count (less than minimum support), we skip all itemsets in that bucket — saving time.
🧠 New term: Hash table – A data structure that stores data using a key-value format and
allows quick searching.
2. Transaction Reduction
What it means:
We reduce the number of transactions (shopping baskets) we look at in each round.
How it helps:
If a transaction doesn't contain any of the current frequent itemsets, it won’t help in future
rounds, so we can ignore it in the next scans.
🧠 Example:
If a shopping basket doesn’t contain any frequent 2-itemsets, it definitely won’t help find any
frequent 3-itemsets — so skip it!
3. Partitioning
What it means:
We split the data into smaller parts (partitions), find frequent itemsets in each partition, and then
combine results.
How it helps:
Instead of scanning the whole dataset multiple times, we do just two scans:
🧠 Why it works:
If an itemset is frequent in the full data, it must be frequent in at least one partition.
💡 In simple words:
If 2 out of 100 customers bought both a computer and antivirus software, the support for
{computer, antivirus} is 2%.
2. Confidence
What it means:
Confidence shows how often the rule has been true — among the transactions that contain
the "if" part, how many also contain the "then" part.
💡 Example:
If 60 out of 100 people who bought a computer also bought antivirus software, then:
Confidence of the rule computer ⇒ antivirus = 60%
3. Frequent Itemset
What it means:
A frequent itemset is a group of items that often appear together in many transactions.
💡 Example:
In a store, if many people buy both bread and butter, then {bread, butter} is a
frequent itemset.
{bread} → 1-itemset
Confidence % of transactions with item A that 60% of those who bought a computer
also include item B also bought antivirus
Frequent Group of items that appear {bread, butter} is a frequent
Itemset together often itemset
This is the core idea of the Pincer Search algorithm and helps in pruning (removing)
unpromising itemsets early.
In Pincer Search:
We do both:
Bottom-Up Search:
Like Apriori — build larger frequent itemsets step-by-step.
Top-Down Search:
At the same time, we try to guess the largest possible frequent itemsets (called
Maximal Frequent Itemsets or MFIs), and use them to cut down the number of itemsets
we need to check.
How Bi-directional Pruning Works
We check small itemsets and grow them (like {A} → {A,B} → {A,B,C}).
If a set is not frequent, we stop growing it.
➤ Example: If {A,B} is infrequent, we don’t check {A,B,C}.
We want to guess big itemsets (like {A, C, D, E} ) that we think might be frequent, and
then check their smaller parts (subsets) to confirm.
If any subset is not frequent, then we can stop checking that big itemset and its related ones.
Transaction ID Items
T1 A, C, D, E
T2 A, C
T3 A, D, E
T4 C, D
T5 A, C, D, E
We guess (from the data) that the set {A, C, D, E} might be a maximal frequent itemset (a
large group of items that appear together often).
So we put {A, C, D, E} into our MFI list (Maximal Frequent Itemset list).
{A, C} appears only once in the dataset (so it's not frequent).
Then we can say:
“If {A, C} itself is not frequent, then the bigger set {A, C, D, E} definitely won’t be
frequent either.”
So we can prune (remove) {A, C, D, E} and save time — no need to scan the whole
database for it.
Because:
Top-down pruning helps avoid checking large itemsets that will eventually fail.
Bottom-up pruning helps avoid growing itemsets from infrequent smaller sets.
When both work together, we save time and reduce database scans.
The Pincer Search algorithm is an efficient approach for mining frequent itemsets in large
datasets. It combines both bottom-up (support-based pruning) and top-down (maximal
frequent itemset search) approaches.
Begin by generating all candidate 1-itemsets (individual items) from the dataset.
Compute their support, which means how frequently each item appears in the
transactions.
From the candidate 1-itemsets, identify the frequent 1-itemsets (those that meet the
minimum support threshold).
Use the frequent 1-itemsets to generate candidate 2-itemsets (pairs of items) and extend
this process to generate larger itemsets.
Maintain a list of Maximal Frequent Itemsets (MFIs) — these are itemsets that are
frequent and cannot be extended further.
For each iteration, perform a top-down search to find potential large frequent itemsets.
The idea is to try to guess large itemsets that might be frequent and check their smaller
subsets.
Use the knowledge of Maximal Frequent Itemsets (MFIs) to prune infrequent itemsets
early.
If any subset of an itemset is infrequent (and we know that larger sets containing that
subset won’t be frequent), eliminate that itemset and its supersets from further
consideration.
Example: If {A, C} is not frequent, stop checking larger sets like {A, C, D, E} .
Step 5: Refine candidate itemsets until no new frequent itemsets are found
Continue refining your candidate itemsets by extending them to larger ones and checking
their support.
Stop when no new frequent itemsets can be found after pruning the non-promising
itemsets.
In market basket analysis, association rule mining plays a key role by helping retailers
understand which products are commonly bought together. This insight can help in creating
strategies to boost sales, optimize store layouts, and personalize marketing efforts.
To solve this problem using the Apriori Algorithm, we will follow these steps
We first count the occurrence of each item in the dataset to see which items meet the
minimum support threshold (60%).
Total transactions = 6
Minimum Support = 60% of 6 = 3.6, which rounds up to 4 transactions.
Frequent 1-itemsets:
Next, we generate candidate 2-itemsets by pairing frequent 1-itemsets and check if their
support meets the minimum threshold.
Candidate 2-itemsets:
Frequent 2-itemsets:
Now, we generate candidate 3-itemsets by pairing frequent 2-itemsets and check if their
support meets the minimum threshold.
Candidate 3-itemsets:
Frequent 3-itemsets:
At this point, we have identified the frequent itemsets from Step 1, Step 2, and Step 3. Now
we will generate strong association rules from those frequent itemsets.
We need to generate rules only from itemsets that have at least 2 items. For each frequent
itemset, we create rules and calculate their confidence to see if they meet the minimum
confidence threshold.
What is Confidence?
Confidence measures how likely it is that if an item is bought, another item will be bought as
well.
For example, the rule l1 ⇒ l2 means "if l1 is bought, then l2 is likely to be bought."
The formula for confidence is:
Where:
Support(A, B) is the number of transactions where both items (A and B) appear together.
Support(A) is the number of transactions where item A appears.
If the confidence is greater than or equal to the minimum confidence threshold (in this case
80%), we will consider the rule strong.
{l1, l2}
{l2, l3}
We will generate the two possible rules for each of these itemsets:
1. l1 ⇒ l2
Support(l1, l2) = 4 (Both l1 and l2 appear together in 4 transactions)
Support(l1) = 4 (l1 appears in 4 transactions)
Confidence(l1 ⇒ l2) = 4 ÷ 4 = 1.0 (100%)
Since Confidence = 100%, which is greater than the minimum confidence (80%), this
rule is strong.
2. l2 ⇒ l1
Support(l1, l2) = 4 (Both l1 and l2 appear together in 4 transactions)
Support(l2) = 5 (l2 appears in 5 transactions)
Confidence(l2 ⇒ l1) = 4 ÷ 5 = 0.8 (80%)
Since Confidence = 80%, which is equal to the minimum confidence (80%), this rule is
also strong.
1. l2 ⇒ l3
Support(l2, l3) = 4 (Both l2 and l3 appear together in 4 transactions)
Support(l2) = 5 (l2 appears in 5 transactions)
Confidence(l2 ⇒ l3) = 4 ÷ 5 = 0.8 (80%)
Since Confidence = 80%, which meets the minimum confidence (80%), this rule is
strong.
2. l3 ⇒ l2
Since Confidence = 100%, which is greater than the minimum confidence (80%), this rule
is strong.
From the above calculations, we have the following strong association rules (rules that meet
the minimum confidence of 80%):
1. l1 ⇒ l2 (Confidence = 100%)
2. l2 ⇒ l1 (Confidence = 80%)
3. l2 ⇒ l3 (Confidence = 80%)
4. l3 ⇒ l2 (Confidence = 100%)
Summary of Results:
This concludes the process of finding frequent itemsets and generating strong association rules
using the Apriori algorithm.
Instead of scanning the database over and over again like Apriori, DIC does things on the go
— while it's already scanning. It starts checking new combinations even before the current
step is finished.
Example to Understand
Instead of looking at the whole database at once, DIC splits it into chunks called partitions.
This lets it process data bit by bit.
Let’s divide the data into 3 partitions:
Partition 1 → T1, T2
Partition 2 → T3, T4
Partition 3 → T5
T3: B, C, D
T4: A, B, D
A → now 4 times
B → now 4 times
C → now 3 times
D → now 2 times
Now we look at pairs (2-itemsets) like:
{A, B} → Appears in T1, T4, T5 → 3 times → ✅ frequent
{B, C} → Appears in T1, T3, T5 → 3 times → ✅ frequent
{A, D} → Only appears in T4, T5 → 2 times → ❌ not frequent → Prune it (i.e., remove it)
So even before finishing all partitions, we already know some combinations are frequent or not.
That’s why DIC is faster.
{A, B}
{B, C}
{B, D}
It’s a smarter way to find frequent itemsets (popular combinations of items) from large
datasets like supermarket purchases — similar to Apriori, but much faster and more efficient.
Imagine you work in a supermarket chain with 8 stores. Instead of analyzing all sales from all
stores at once (which is slow), you first analyze each store separately, then combine the
results to find popular products sold across all stores.
Itemset: A group of items (e.g., {A, B} means A and B are bought together).
Frequent Itemset: An itemset that appears often enough (based on a threshold).
Support: Number of times an itemset appears in transactions.
min_sup (minimum support): The minimum number of times an itemset must appear to
be considered frequent.
Partition 1 Transactions:
Partition 2 Transactions:
Same process:
3-itemsets?
Key Concepts
T1: A, B, C
T2: A, C
T3: B, C, D
T4: A, B, D
T5: A, B, C, D
A – 4 times
B – 4 times
C – 4 times
D – 3 times
Advantages of FP-Growth
Feature Benefit
No candidate generation Saves time and memory
Fewer database scans Just 2 passes (once for frequency, once for tree)
Space-efficient Uses compact FP-tree with shared paths
Fast on large datasets Works much better than Apriori on large data
TID Items
T1 {f, a, c, d, m, p}
T2 {a, b, c, f, m}
T3 {b, f, j}
T4 {b, c, k, p}
T5 {a, f, c, e, p, m}
T6 {f, a, c, d, m, p}
Root
└── f:5
├── c:5
├── a:5
├── m:5
├── p:3
└── b:1
└── b:1
└── c:1
├── p:1
└── b:1
Start with b:
Paths with b :
f, c, a, m → b (from T2)
f → b (from T3)
c → p → b (from T4)
Create conditional pattern base for b:
{f, c, a, m}: 1
{f}: 1
{c, p}: 1
{b, f} : 2
{b, c} : 2
Next: p
Paths with p:
Pattern base:
{f, c, a, m}: 3
{c}: 1
Itemsets with p:
{p} : 4
{p, f} : 3
{p, c} : 4
{p, a} : 3
{p, m} : 3
{p, c, a} : 3
{p, f, c} : 3
{p, f, a} : 3
{p, f, m} : 3
{p, f, c, a} : 3
⚡ Repeat for m, a, c, f…
(We'll get many more combinations.)
{c} (5)
{a} (4)
{m} (4)
{p} (4)
{b} (3)
{f, c} (4)
{f, a} (4)
{f, m} (4)
{f, p} (3)
{c, a} (4)
{c, p} (4)
{a, m} (4)
{a, p} (3)
{m, p} (3)
{f, c, a} (4)
{f, c, m} (4)
{f, a, m} (4)
{c, a, m} (4)
{f, c, a, m} (4)
{p, f, c, a} (3)
{b, f} (2) ❌
{b, c} (2) ❌
"If an itemset is frequent, then all of its subsets must also be frequent."
In simple words:
If {A, B, C} is a frequent itemset, then {A} , {B} , {C} , {A, B} , {A, C} , and {B, C}
must also be frequent.
So, if any subset is not frequent, we can skip checking its supersets. This saves time
by reducing the number of candidate itemsets.
Now let's apply Apriori Algorithm step by step to the data with min
support = 2
TID Items
T1 A, B
T2 B, D
T3 B, C
T4 A, B, D
T5 A, C
T6 B, C
T7 A, C
T8 A, B, C, E
T9 A, B, C
A → 6 ✅
B → 7 ✅
C → 6 ✅
D → 2 ✅
E → 1 ❌ (Remove)
L1 = {A}, {B}, {C}, {D}
{A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}
{A, B} → 4 ✅
{A, C} → 4 ✅
{A, D} → 1 ❌
{B, C} → 4 ✅
{B, D} → 2 ✅
{C, D} → 0 ❌
{A, B, C}
Count support
{A, B, C} → 2 ✅
L3 = {A, B, C}
Final Frequent Itemsets: