0% found this document useful (0 votes)
4 views

Module 4 Full

Uploaded by

amarnathamaru122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module 4 Full

Uploaded by

amarnathamaru122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Dr.L.C.

MANIKANDAN
Professor, Department of CSE,
Universal Engineering College.
 Association Rules – Introduction
 Methods to discover Association rules
 Apriori (Level-wise algorithm)
 Partition Algorithm
 Pincer Search Algorithm
 Dynamic Itemset Counting Algorithm
 FP-tree Growth Algorithm

Dr.L.C.MANIKANDAN 3/18/2025 2
Association rule mining finds interesting associations and relationships
among large sets of data items.
 This rule shows how frequently an itemset occurs in a transaction.
Example: Market Based Analysis
 It allows retailers to identify relationships between the items that people
buy together frequently.
 This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets”.
 The discovery of such associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased
together by customers.
Dr.L.C.MANIKANDAN 3/18/2025 3
 Example: If customers are
buying milk, how likely are
they to also buy bread on the
same trip to the supermarket?
 Such information can lead to
increased sales by helping
retailers do selective
marketing and plan their shelf
space.

Dr.L.C.MANIKANDAN 3/18/2025 4
Frequent Itemset
 A set of items is referred to as an itemset.
 An itemset that contains k items is a k-itemset.
 The set {computer, antivirus_software} is a 2-itemset.
 A frequent item set is a set of items that occur together frequently in a
dataset.
 Ex1: In a supermarket environment, the items bread and butter are likely
to be purchased together by many customers.
 So, {bread, butter} is an example for frequent itemset.
 The association between the items are represented by the association
rule;
bread=>butter
Dr.L.C.MANIKANDAN 3/18/2025 5
 Eg2: In an electronic store, customers who purchase computers also
tend to buy antivirus software at the same time.
 So, {computer, antivirus software} is an example for frequent
itemset.
 It is represented by the following association rule;

Dr.L.C.MANIKANDAN 3/18/2025 6
Measures of Rule Interestingness:
 Support and Confidence are two measures of rule interestingness.
 Support reflects the usefulness of discovered association rules.
 Confidence reflects the certainty of discovered association rules.
 Association rules are considered interesting if they satisfy both a
minimum support threshold and a minimum confidence threshold.
 Such thresholds can be set by users or domain experts.
 Ex: Consider the following association rule;

Dr.L.C.MANIKANDAN 3/18/2025 7
 A support of 2% means that 2% of all the transactions under analysis show that
computer and antivirus software are purchased together.
 A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software.

 Association rule mining can be viewed as a two-step process:


 Find all frequent itemsets: itemsets will occur at least as frequently as a
predetermined minimum support count.
 Generate strong association rules from the frequent itemsets: it must satisfy
minimum support & minimum confidence.
Dr.L.C.MANIKANDAN 3/18/2025 8
Q:

Dr.L.C.MANIKANDAN 3/18/2025 9
 Apriori Algorithm is a fundamental method in association rule mining
 Primarily used to find frequent itemsets in large datasets.
 It follows a level-wise approach, where frequent itemsets are iteratively
expanded using the Apriori property.

Key Concept:
 If an itemset is frequent, then all its subsets must also be frequent
(Apriori Property).
 If an itemset is infrequent, then all its supersets must also be infrequent.
 Commonly used in: Market Basket Analysis, Recommendation Systems,
Fraud Detection.
Dr.L.C.MANIKANDAN 3/18/2025 10
Working of Apriori Algorithm
Step 1: Count Individual Item Frequencies
 Scan the database and count the occurrences of each 1-itemset.
 Remove infrequent items (i.e., those below the minimum support
threshold).
Step 2: Generate Candidate Itemsets (Ck)
 Use previous frequent itemsets (Lk-1) to generate new k-itemsets
(Ck).
 Only keep those whose subsets are frequent (Apriori Pruning).
Step 3: Compute Support & Prune Infrequent Itemsets
 Scan the database and count occurrences of candidate k-itemsets.
 Remove itemsets below the minimum support.
Dr.L.C.MANIKANDAN 3/18/2025 11
Step 4: Repeat Until No More Frequent Itemsets
 Continue generating larger itemsets until no new frequent
itemsets are found.
 Use these frequent itemsets to generate association rules
(Confidence & Lift).

Dr.L.C.MANIKANDAN 3/18/2025 12
Example: Dataset Example (Market Basket Transactions)
Transaction ID Items Purchased
T1 A, B, C
T2 A, C
T3 B, C, D
T4 A, B, D
T5 A, B, C, D

Step 1: Count 1-itemset Frequencies


 Support Count (Min Support = 2 transactions)
 A: 4, B: 4, C: 3, D: 3
 Frequent 1-itemsets: {A}, {B}, {C}, {D}
Dr.L.C.MANIKANDAN 3/18/2025 13
Step 2: Generate Candidate 2-itemsets & Prune
 Candidates: {A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}
 Frequent 2-itemsets: {A, B}, {A, C}, {B, C}, {B, D}
Step 3: Generate Candidate 3-itemsets & Prune
 Candidates: {A, B, C}, {A, B, D}, {B, C, D}
 Frequent 3-itemsets: {A, B, C}, {B, C, D}
Step 4: Generate Candidate 4-itemset
 {A, B, C, D} does not meet support → STOP.
 Frequent Itemsets Discovered:{A, B}, {A, C}, {B, C}, {B, D}, {A, B, C},
{B, C, D}

Dr.L.C.MANIKANDAN 3/18/2025 14
Association Rule Generation
 Once frequent itemsets are identified, we generate association rules using
Confidence & Lift.
 Example Rule: {A, B} → {C}
 Confidence = Support(A, B, C) / Support(A, B)

Advantages
 Easy to Understand & Implement
 Works Well for Small to Medium Datasets
 Finds Strong Association Rules

Disadvantages
 Multiple Database Scans → Slow for large datasets.
 Exponential Candidate Growth → High memory usage.
 Not Efficient for High-Dimensional Data
Dr.L.C.MANIKANDAN 3/18/2025 15
 It is an improved approach to frequent itemset mining.
 Designed to find maximal frequent itemsets in large transactional databases.
 It optimizes the traditional Apriori algorithm by combining bottom-up
(support-based pruning) and top-down (maximal frequent itemset search)
approaches.

Why Pincer Search?


 Apriori Algorithm scans the database multiple times, making it inefficient for
large datasets.
 Pincer Search reduces unnecessary computations by maintaining both
frequent and maximal frequent itemsets simultaneously.
 This approach helps in early discovery of maximal frequent itemsets,
reducing database scans.
Dr.L.C.MANIKANDAN 3/18/2025 16
Working of Pincer Search Algorithm
 Pincer Search Algorithm integrates:
1. Bottom-up search: Similar to Apriori, it grows itemsets step by step.
2. Top-down search: It maintains a maximal frequent itemset (MFI) list,
pruning infrequent itemsets early.

Steps of Pincer Search


1. Generate candidate 1-itemsets and compute their support.
2. Identify frequent 1-itemsets and extend them to larger itemsets (similar to
Apriori).
3. Simultaneously perform a top-down search: Maintain a list of maximal
frequent itemsets (MFIs).
Dr.L.C.MANIKANDAN 3/18/2025 17
4. Prune infrequent itemsets early using MFI knowledge.
5. Refine candidate itemsets until no new frequent itemsets are found.

Advantages:
 Faster than Apriori (Fewer database scans).
 Efficient for large datasets with long frequent itemsets.
 Reduces computational complexity by using both top-down and
bottom-up approaches.

Dr.L.C.MANIKANDAN 3/18/2025 18
Transaction ID Items Purchased
Example: dataset: T1 A, B, C
T2 A, B
T3 A, C
T4 B, C, D
T5 A, B, C, D

 Step 1: Identify frequent 1-itemsets:


{A}, {B}, {C}, {D} (if they meet the support threshold).
 Step 2: Generate 2-itemsets, check support.
 Step 3: While doing this, maintain a maximal frequent itemset list to prune
unpromising itemsets.
 Step 4: Continue until no new frequent itemsets can be generated.
Dr.L.C.MANIKANDAN 3/18/2025 19
 Partitioning Algorithm is an efficient method for frequent itemset mining.
 Specifically designed to improve performance over the Apriori
algorithm.
 It divides the database into partitions, processes each partition
separately, and then combines the results to find frequent itemsets.

Key Idea
 Instead of scanning the entire database multiple times, the algorithm first
identifies local frequent itemsets within each partition.
 And then merges them to find globally frequent itemsets.

Dr.L.C.MANIKANDAN 3/18/2025 20
Steps of the Partitioning Algorithm
Step 1: Divide the Database into Partitions
 The dataset is split into multiple partitions (subsets).
 Each partition is processed independently, reducing memory
overhead.

Step 2: Identify Local Frequent Itemsets


 Minimum support threshold (min_sup) is applied within each
partition.
 Itemsets that are frequent within a partition are potentially frequent
globally.

Dr.L.C.MANIKANDAN 3/18/2025 21
Step 3: Merge Local Frequent Itemsets
 A global candidate set is formed by combining frequent itemsets
from all partitions.
 The final global support count is calculated for each itemset across
the full dataset.

Step 4: Filter Final Frequent Itemsets


 The global frequent itemsets are determined by applying the original
min_sup to the merged counts.

 Optimization: Only the local frequent itemsets are checked in the


final global scan, avoiding unnecessary computations.
Dr.L.C.MANIKANDAN 3/18/2025 22
Transaction Items
Example: Dataset Example (Market Basket ID Purchased
Transactions) T1 A, B, C
Step 1: Partitioning the Database T2 A, C
 Assume two partitions: T3 B, C, D
 Partition 1: {T1, T2, T3, T4} T4 A, B, D
 Partition 2: {T5, T6, T7, T8} T5 A, B, C, D
T6 B, C
T7 A, D
Step 2: Identify Local Frequent Itemsets in Each T8 A, B, C
Partition
 Applying min_sup = 2
 Partition 1 Frequent Itemsets: {A, B, C}, {A, C}, {B, C},
{B, D}
 Partition 2 Frequent Itemsets: {A, B, C}, {B, C}, {A, D}
Dr.L.C.MANIKANDAN 3/18/2025 23
Step 3: Merge Local Frequent Itemsets
 Global Candidate Itemsets: {A, B, C}, {A, C}, {B, C}, {B, D}, {A, D}

Step 4: Final Global Scan & Prune Non-Frequent Itemsets


 Final Frequent Itemsets: {A, B, C}, {A, C}, {B, C}, {A, D}

 Only one final scan is needed, making the algorithm much faster than
Apriori.

Dr.L.C.MANIKANDAN 3/18/2025 24
How Partitioning Algorithm Solves Apriori’s Disadvantages?

Problem in Apriori Solution in Partitioning Algorithm


Requires only two scans (partition-wise & final
Multiple database scans
global scan)
Processes each partition independently,
High computational cost
reducing memory load
Breaks dataset into manageable chunks,
Slow for large datasets
increasing efficiency
Only local frequent itemsets are considered,
Too many candidate itemsets
reducing computations
Partitioning Algorithm is significantly more efficient than Apriori, especially
for large datasets.
Dr.L.C.MANIKANDAN 3/18/2025 25
 DIC Algorithm is an improved version of the Apriori algorithm used for
frequent itemset mining.
 It aims to reduce the number of database scans by dynamically adding and
removing itemsets during the scanning process.

Key Concept:
 Instead of scanning the database multiple times (like Apriori), DIC
interleaves candidate generation and counting within a single database
pass.
 It dynamically starts counting new itemsets before previous iterations are
completed, making it faster than Apriori.
 Used in: Market Basket Analysis, Recommendation Systems, Web Mining.

Dr.L.C.MANIKANDAN 3/18/2025 26
Working of Dynamic Itemset Counting
Step 1: Partition the Database
 The database is divided into equal-sized partitions.
 Instead of waiting for a full database scan, new itemsets start being
counted midway in different partitions.

Step 2: Count Frequent 1-Itemsets


 The algorithm begins by scanning the first partition and identifying
frequent 1-itemsets.
 As more partitions are scanned, new itemsets are introduced
dynamically.
Dr.L.C.MANIKANDAN 3/18/2025 27
Step 3: Generate & Count Candidate Itemsets Dynamically
 Unlike Apriori, DIC does not wait for a full pass to generate new
itemsets.
 Itemsets are marked as frequent, infrequent, or uncertain based on
observed support.

Step 4: Prune Infrequent Itemsets & Repeat


 Once enough partitions are scanned, itemsets with low support are
eliminated.
 The process continues until no new frequent itemsets are found.

Dr.L.C.MANIKANDAN 3/18/2025 28
 Dataset Example (Market Basket Transactions)

Transaction ID Items Purchased


T1 A, B, C
T2 A, C
T3 B, C, D
T4 A, B, D
Step 1: Partition the Database
T5 A, B, C, D
 Split the dataset into partitions
 Partition 1: {T1, T2}
 Partition 2: {T3, T4}
 Partition 3: {T5}

Dr.L.C.MANIKANDAN 3/18/2025 29
Step 2: Start Counting Frequent 1-Itemsets
 After Partition 1, {A, B, C} are candidates.
 After Partition 2, {D} appears frequently.
 Frequent itemsets start forming dynamically before scanning all partitions.

Step 3: Dynamically Count & Prune Itemsets


 {A, B}, {B, C} reach threshold support → Kept.
 {A, D} is below support → Pruned.

Step 4: Generate Association Rules


 Frequent itemsets {A, B}, {B, C}, {B, D} are used for association rules.
Dr.L.C.MANIKANDAN 3/18/2025 30
Advantages:
 Fewer Database Scans → Faster than Apriori.
 More Efficient Candidate Pruning → Reduces memory usage.
 Adaptable → Itemsets are counted dynamically.

Disadvantages:
 Complex Implementation → More difficult than Apriori.

 Requires Careful Partitioning → Poor partitioning may lead to


inefficiencies.

Dr.L.C.MANIKANDAN 3/18/2025 31
 Frequent Pattern Tree (FP-Tree) Growth Algorithm is an efficient algorithm
used for frequent itemset mining in large datasets.
 It is an improvement over the Apriori algorithm
 it avoids multiple database scans and candidate generation, making it
faster and more scalable.

Key Concept:
 Uses a compact tree structure (FP-tree) to store frequent itemsets.
 Eliminates the need for candidate generation like in Apriori.
 Reduces database scans, improving efficiency for large datasets.
 Used in: Market Basket Analysis, Web Mining, Bioinformatics.
Dr.L.C.MANIKANDAN 3/18/2025 32
Working of FP-Tree Growth Algorithm
Step 1: Scan the Database & Find Frequent Items
 dataset is scanned once to compute the support count of each item.
 Items below minimum support are removed.

Step 2: Construct the FP-Tree


 A tree-like structure is built, where each transaction shares common paths.
 Items are ordered based on frequency to create a compact structure.

Step 3: Extract Frequent Itemsets using Conditional Pattern Bases


 A conditional FP-tree is constructed for each frequent item.
 The frequent itemsets are recursively extracted without candidate
generation.
Dr.L.C.MANIKANDAN 3/18/2025 33
Items
Transaction ID
Purchased
T1 A, B, C
 Dataset Example (Market Basket Transactions)
T2 A, C
T3 B, C, D
T4 A, B, D
T5 A, B, C, D

Step 1: Find Frequent Items (Support Count)


 Min Support = 2 Transactions
Item Support Count
A 4
B 4
C 4
D 3

 All items meet the minimum support threshold.


Dr.L.C.MANIKANDAN 3/18/2025 34
Step 2: Construct FP-Tree
 Transactions are inserted into the tree in descending frequency
order.
 Shared paths reduce memory usage.

Example FP-Tree Structure

Dr.L.C.MANIKANDAN 3/18/2025 35
Step 3: Extract Frequent Itemsets using Conditional FP-Trees
 Conditional FP-trees are built for each frequent item.
 Frequent itemsets are extracted recursively.
 Frequent Itemsets Found:
{A}, {B}, {C}, {D}, {A, B}, {A, C}, {B, C}, {B, D}, {A, B, C}, {B, C, D}

Dr.L.C.MANIKANDAN 3/18/2025 36
Dr.L.C.MANIKANDAN 3/18/2025 37

You might also like