Apriori Algorithm Overview: Prior Knowledge Apriori Property
Apriori Algorithm Overview: Prior Knowledge Apriori Property
The Apriori Algorithm uses prior knowledge of itemset properties to mine frequent itemsets for
Boolean association rules. It leverages the Apriori property, which states:
This property aids in significantly reducing the search space during candidate generation.
Detailed Algorithm
1. First Iteration:
○ Compute frequent 1-itemsets (L1L_1L1) by counting the occurrences of each
item in transactions.
○ Keep items satisfying the minimum support threshold.
2. Subsequent Iterations:
○ Join: Generate candidate kkk-itemsets (CkC_kCk) by joining Lk−1L_{k-
1}Lk−1 with itself. This involves merging (k−1)(k-1)(k−1)-itemsets
sharing the first k−2k-2k−2 items.
○ Prune: Eliminate candidates in CkC_kCk whose subsets are not frequent.
3. Database Scan:
○ Count occurrences of candidate kkk-itemsets from CkC_kCk.
○ Retain only those meeting the minimum support in LkL_kLk.
4. Termination:
○ Repeat until no frequent itemsets (LkL_kLk) are generated.
L1 = {frequent items};
end
return k Lk;
Improve the EFFICIENCY of apriori:
Key Concepts
2. Transaction Reduction
● Purpose: Minimize the number of transactions scanned in subsequent iterations.
● Method:
○ Transactions without frequent k-itemsets are ignored in future scans for k+1
itemsets.
● Benefit: Reduces the computational overhead in future database scans.
3. Partitioning
● Purpose: Find frequent itemsets efficiently using just two database scans.
● Method:
○ Phase I: Divide the database into non-overlapping partitions and find local
frequent itemsets in each.
○ Phase II: Merge local frequent itemsets to form global candidates and verify their
support in the database.
● Benefit: Ensures efficient use of memory as partitions are small enough to fit into main
memory.
4. Sampling
● Purpose: Improve efficiency by analyzing a subset of the database.
● Method:
○ Randomly sample a portion of the database.
○ Identify frequent itemsets in the sample using a lower support threshold.
○ Validate itemsets against the entire database.
● Benefit: Reduces the computational cost of processing the entire dataset at once, with a
tradeoff in accuracy.
● The Multiway Array Aggregation for Full Cube Computation method used in MOLAP
systems.
● This technique divides a multidimensional data array into smaller chunks to fit within
memory and improves memory access by optimizing the order in which cells are visited.
● The aggregation is done simultaneously across multiple dimensions to reduce
unnecessary revisits and computational costs.
● It employs compression techniques to manage sparse arrays and is effective when
memory space is sufficient, but becomes less feasible with high-dimensional or sparse
data sets.
1. Chunking: The data array is partitioned into smaller, memory-sized chunks, which are
then processed.
2. Simultaneous Aggregation: The aggregation is done on multiple dimensions
simultaneously during a chunk scan.
3. Efficient Ordering: The chunks are processed in an optimal order to reduce memory
usage and computational overhead.
● Method: the planes should be sorted and computed according to their size in ascending
order
● Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at
a time for the largest plane
● This method works well for moderate-dimensional arrays that aren't highly sparse,
providing faster performance than traditional ROLAP systems
● Limitation of the method: computing well only for a small number of dimensions
● If there are a large number of dimensions, “top-down” computation and iceberg cube
computation methods can be explored
BUC:
● The BUC (Bottom-Up Construction) algorithm is used for computing sparse and
iceberg cubes in data warehousing.
● It constructs the cube from the apex cuboid (most aggregated level) down to the base
cuboid (least aggregated level).
● BUC uses partitioning and pruning techniques to efficiently compute the cube by
recursively aggregating data.
Key points:
● Top-Down Exploration: The algorithm starts at the apex cuboid and works towards the
base cuboid, processing the data by partitioning it at each dimension.
● Apriori Property: It prunes partitions that do not meet the minimum support threshold
avoiding unnecessary calculations.
● Partitioning: At each level, the data is partitioned based on the dimension values, and
the algorithm recurses to the next level if the partition meets the support condition.
● Optimization: If a partition contains only one tuple, the result is directly written to avoid
unnecessary partitioning costs.
● CountingSort: A linear sorting technique is used to speed up partitioning.
The BUC method helps in efficient iceberg cube computation by reducing the amount of work
done through pruning and partitioning, especially when dealing with large datasets.
Star-Cubing algorithm
● The Star-Cubing algorithm is designed to compute iceberg cubes efficiently by
combining bottom-up and top-down computation strategies.
● It leverages a data structure called a star-tree for data compression and faster
computation, which helps reduce memory usage and speeds up the process.
Key Concepts:
1. Star-Tree Structure: Represents a compressed version of cuboids, where star-nodes
(non-essential nodes) are collapsed, reducing the size of the cuboid tree.
2. Shared Dimensions: These allow for shared computation and pruning, enabling faster
processing by eliminating unnecessary calculations.
3. Top-Down and Bottom-Up Computation:
○ Bottom-Up: The algorithm starts from the base cuboid (most detailed) and
aggregates data upwards, generating higher-level cuboids.
○ Top-Down: While traversing the star-tree, the algorithm uses the top-down
model to explore shared dimensions and prune cuboids that are guaranteed not
to satisfy the iceberg condition based on previously computed results.
4. Pruning: If a dimension fails the iceberg condition (e.g., count < threshold), the entire
subtree rooted at that dimension can be pruned..
high-dimensional OLAP
● In high-dimensional OLAP, the full data cube requires massive storage and computation
time, making it impractical.
● Iceberg cubes provide a smaller alternative by only computing cells that meet a certain
threshold, but they still have limitations like high cost and lack of incremental updates.
● A better solution is the shell fragment approach, which focuses on precomputing
smaller fragments of the cube, improving computation and storage efficiency.