0% found this document useful (0 votes)
13 views5 pages

Apriori Algorithm Overview: Prior Knowledge Apriori Property

Uploaded by

Harshar Harshar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Apriori Algorithm Overview: Prior Knowledge Apriori Property

Uploaded by

Harshar Harshar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Apriori Algorithm Overview

The Apriori Algorithm uses prior knowledge of itemset properties to mine frequent itemsets for
Boolean association rules. It leverages the Apriori property, which states:

All nonempty subsets of a frequent itemset must also be frequent.

This property aids in significantly reducing the search space during candidate generation.

Detailed Algorithm
1. First Iteration:
○ Compute frequent 1-itemsets (L1L_1L1) by counting the occurrences of each
item in transactions.
○ Keep items satisfying the minimum support threshold.
2. Subsequent Iterations:
○ Join: Generate candidate kkk-itemsets (CkC_kCk) by joining Lk−1L_{k-
1}Lk−1 with itself. This involves merging (k−1)(k-1)(k−1)-itemsets
sharing the first k−2k-2k−2 items.
○ Prune: Eliminate candidates in CkC_kCk whose subsets are not frequent.
3. Database Scan:
○ Count occurrences of candidate kkk-itemsets from CkC_kCk.
○ Retain only those meeting the minimum support in LkL_kLk.
4. Termination:
○ Repeat until no frequent itemsets (LkL_kLk) are generated.

The Apriori Algorithm (Pseudo-Code)

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};

for (k = 1; Lk !=; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return k Lk;
Improve the EFFICIENCY of apriori:

1.Direct Hashing and Pruning (DHP): Reduce the Number of Candidates


Direct Hashing and Pruning (DHP) is a technique used to reduce the number of candidate
itemsets generated during frequent itemset mining. By leveraging hashing and pruning
strategies, DHP optimizes the computational process. Here's how it works:

Key Concepts

1. Hashing Candidate Itemsets:


○ During the generation of candidate itemsets, a hash table is used to store counts
of itemsets.
○ Each k-itemset is hashed into a bucket using a predefined hash function.
○ The bucket count reflects how many transactions contain the k-itemsets mapped
to that bucket.
2. Pruning with Bucket Counts:
○ Buckets with counts below the support threshold can be pruned, as no k-
itemset within those buckets can be frequent.
○ This pruning significantly reduces the number of candidate itemsets to be
considered in subsequent iterations.

2. Transaction Reduction
● Purpose: Minimize the number of transactions scanned in subsequent iterations.
● Method:
○ Transactions without frequent k-itemsets are ignored in future scans for k+1
itemsets.
● Benefit: Reduces the computational overhead in future database scans.

3. Partitioning
● Purpose: Find frequent itemsets efficiently using just two database scans.
● Method:
○ Phase I: Divide the database into non-overlapping partitions and find local
frequent itemsets in each.
○ Phase II: Merge local frequent itemsets to form global candidates and verify their
support in the database.
● Benefit: Ensures efficient use of memory as partitions are small enough to fit into main
memory.

4. Sampling
● Purpose: Improve efficiency by analyzing a subset of the database.
● Method:
○ Randomly sample a portion of the database.
○ Identify frequent itemsets in the sample using a lower support threshold.
○ Validate itemsets against the entire database.
● Benefit: Reduces the computational cost of processing the entire dataset at once, with a
tradeoff in accuracy.

5. Dynamic Itemset Counting (DIC)


● Purpose: Reduce the number of database scans by dynamically adding candidate
itemsets during a scan.
● Method:
○ Divide the database into blocks with marked start points.
○ Add new candidate itemsets whenever the count-so-far for an itemset exceeds
the minimum support threshold.
● Benefit: Leads to fewer database scans compared to the static candidate generation of
Apriori.

Data computation methods:


Multiway Array Aggregation

● The Multiway Array Aggregation for Full Cube Computation method used in MOLAP
systems.
● This technique divides a multidimensional data array into smaller chunks to fit within
memory and improves memory access by optimizing the order in which cells are visited.
● The aggregation is done simultaneously across multiple dimensions to reduce
unnecessary revisits and computational costs.
● It employs compression techniques to manage sparse arrays and is effective when
memory space is sufficient, but becomes less feasible with high-dimensional or sparse
data sets.

The key aspects of the approach include:

1. Chunking: The data array is partitioned into smaller, memory-sized chunks, which are
then processed.
2. Simultaneous Aggregation: The aggregation is done on multiple dimensions
simultaneously during a chunk scan.
3. Efficient Ordering: The chunks are processed in an optimal order to reduce memory
usage and computational overhead.

● Method: the planes should be sorted and computed according to their size in ascending
order
● Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at
a time for the largest plane
● This method works well for moderate-dimensional arrays that aren't highly sparse,
providing faster performance than traditional ROLAP systems
● Limitation of the method: computing well only for a small number of dimensions
● If there are a large number of dimensions, “top-down” computation and iceberg cube
computation methods can be explored

BUC:

● The BUC (Bottom-Up Construction) algorithm is used for computing sparse and
iceberg cubes in data warehousing.
● It constructs the cube from the apex cuboid (most aggregated level) down to the base
cuboid (least aggregated level).
● BUC uses partitioning and pruning techniques to efficiently compute the cube by
recursively aggregating data.

Key points:

● Top-Down Exploration: The algorithm starts at the apex cuboid and works towards the
base cuboid, processing the data by partitioning it at each dimension.
● Apriori Property: It prunes partitions that do not meet the minimum support threshold
avoiding unnecessary calculations.
● Partitioning: At each level, the data is partitioned based on the dimension values, and
the algorithm recurses to the next level if the partition meets the support condition.
● Optimization: If a partition contains only one tuple, the result is directly written to avoid
unnecessary partitioning costs.
● CountingSort: A linear sorting technique is used to speed up partitioning.

The BUC method helps in efficient iceberg cube computation by reducing the amount of work
done through pruning and partitioning, especially when dealing with large datasets.

Star-Cubing algorithm
● The Star-Cubing algorithm is designed to compute iceberg cubes efficiently by
combining bottom-up and top-down computation strategies.
● It leverages a data structure called a star-tree for data compression and faster
computation, which helps reduce memory usage and speeds up the process.

Key Concepts:
1. Star-Tree Structure: Represents a compressed version of cuboids, where star-nodes
(non-essential nodes) are collapsed, reducing the size of the cuboid tree.
2. Shared Dimensions: These allow for shared computation and pruning, enabling faster
processing by eliminating unnecessary calculations.
3. Top-Down and Bottom-Up Computation:
○ Bottom-Up: The algorithm starts from the base cuboid (most detailed) and
aggregates data upwards, generating higher-level cuboids.
○ Top-Down: While traversing the star-tree, the algorithm uses the top-down
model to explore shared dimensions and prune cuboids that are guaranteed not
to satisfy the iceberg condition based on previously computed results.
4. Pruning: If a dimension fails the iceberg condition (e.g., count < threshold), the entire
subtree rooted at that dimension can be pruned..

Steps of Star-Cubing Algorithm:


1. Star-Table Construction: The algorithm starts by scanning the input data twice to build
the star-table and star-tree.
2. Bottom-Up Aggregation: For each cuboid in the star-tree, the algorithm aggregates
values bottom-up and prunes unnecessary nodes.
3. Top-Down Expansion: During the backtracking phase, shared dimensions are
expanded and additional pruning is performed.

high-dimensional OLAP
● In high-dimensional OLAP, the full data cube requires massive storage and computation
time, making it impractical.
● Iceberg cubes provide a smaller alternative by only computing cells that meet a certain
threshold, but they still have limitations like high cost and lack of incremental updates.
● A better solution is the shell fragment approach, which focuses on precomputing
smaller fragments of the cube, improving computation and storage efficiency.

Shell Fragment Approach:


The key idea is that OLAP queries typically focus on a small subset of dimensions at a time.
This approach precomputes cube fragments for smaller subsets of dimensions, making OLAP
queries more efficient without computing the entire high-dimensional cube.

1. Partitioning: Dimensions are grouped into fragments. For example, in a 60-dimensional


cube, you might group the dimensions into 20 fragments of 3 dimensions each.
2. Inverted Index: For each fragment, an inverted index is created, listing the tuples that
match each attribute value. This enables quick retrieval of relevant data.
3. Cube Shell Fragments: Instead of computing all cuboids, the shell fragments are
precomputed. These fragments represent intersections of subsets of the dimensions.
For each fragment, cuboids are computed by intersecting TID lists (lists of tuple
identifiers) corresponding to each dimension value.

You might also like