Unit 2
Unit 2
It refers to the process of aggregating and organizing multi-dimensional data into a data cube,
which is a structure commonly used in data warehousing and OLAP (Online Analytical
Processing).
Computation Process
1. Data Aggregation:
o The data cube is built by computing aggregations (e.g., sum, count, average)
across multiple dimensions.
o For example, in sales data, you might aggregate by:
Time → Year, Quarter, Month
Region → Country, State, City
Product → Category, Brand, Item
2. Cuboid Generation:
o A full data cube consists of cuboids representing all possible combinations of
dimensions.
o For dimensions, there are 2n2^n2n cuboids, including:
Base cuboid: The raw data without aggregation.
Aggregated cuboids: Higher-level summaries.
3. OLAP Operations:
o Slice: Selects a single dimension and reduces the cube.
o Dice: Selects a sub-cube by specifying ranges on multiple dimensions.
o Roll-up: Aggregates data along a dimension (e.g., from month to year).
o Drill-down: Expands data along a dimension (e.g., from year to month).
2. CUBE MATERIALIZATION
Data Cube:
A data cube is a multi-dimensional array of values, typically used in OLAP (Online Analytical
Processing) systems for decision support. It represents data in multiple dimensions, allowing
efficient data retrieval and aggregation.
Cube Materialization
Cube Materialization is the process of computing and storing aggregated values for different
combinations of dimensions in a data cube. The goal is to optimize query performance by
precomputing and storing results instead of calculating them dynamically.
Full Materialization:
Definition: All possible aggregations across all dimension combinations are precomputed and
stored.
Example: For a cube with dimensions Product, Region, and Time, full materialization would
include:
pairwise aggregations:
Advantages:
High storage cost and maintenance overhead, especially for large datasets with many
dimensions.
Partial Materialization:
Advantages:
Disadvantages:
Comparison Table
Feature Full Materialization Partial Materialization No Materialization
Storage High Moderate Low
Query Speed Fast Medium Slow
Computation
Precomputed Mixed On-the-fly
Time
Frequent, predictable Balanced storage vs Infrequent, unpredictable
Best for
queries performance queries
Full Cube refers to the complete set of aggregations in a data cube, including all possible
combinations of dimensions and their aggregations. It represents every level of summarization
from the most detailed (base data) to the most aggregated (grand total).
In full cube materialization, the system computes and stores the results for:
An Iceberg Cube is an optimized version of a full data cube where only significant and relevant
aggregations are materialized. The idea is to prune insignificant or low-value aggregations
(which may not be useful for analysis), reducing both storage space and computation time.
In a full cube, all possible aggregations are computed, even if they are insignificant (e.g., sales
of rare items with very low revenue).
In an iceberg cube, only the aggregations that meet a specific threshold condition are stored.
For example:
In SQL, you can create an iceberg cube by using the HAVING clause to filter out low-value
aggregations:
sql
CopyEdit
SELECT
product, region, time, SUM(sales) AS total_sales
FROM sales_data
GROUP BY CUBE(product, region, time)
HAVING SUM(sales) > 1000;
A Closed Cube is an optimized form of a full data cube where only closed cells are materialized.
A closed cell is an aggregation that cannot be derived from any more detailed aggregation—it
contains the maximum level of information for a given combination of dimensions.
A closed cell is an aggregation where no finer aggregation produces the same value.
It represents the most granular or meaningful aggregation for a given group of
dimensions.
Redundant aggregations (which can be derived from other cells) are not materialized,
reducing storage costs.
(P1, North, ALL) → The total sales for P1 in North is already stored.
(P1, ALL, ALL) → The total sales for P1 is the same as the overall total.
These cells are redundant and can be derived from more detailed cells, so they aren’t
materialized in the closed cube.
Closed Cube Construction
In a closed cube:
A Shell Cube is a partial materialization of a data cube, where only a subset of the cube’s
aggregations is precomputed and stored. The goal is to balance query performance and storage
efficiency by materializing only the most frequently queried or useful aggregations, while less
common ones are computed on-the-fly.
1. Query Efficiency:
o Frequently accessed cuboids are precomputed, making common queries faster.
2. Storage Optimization:
o Less storage space is required compared to a full cube.
3. Balanced Performance:
o Rarely accessed cuboids are computed dynamically, saving space.
4. Ideal for Large Datasets:
o Reduces the materialization burden on large datasets.
In OLAP (Online Analytical Processing), data cube computation refers to the process of
constructing and materializing a multi-dimensional cube from a large dataset. The goal is to
efficiently generate and store the required aggregations (cuboids) for fast querying and
analysis.
There are four main methods used for data cube computation:
The Multiway Method is an efficient array-based data cube computation technique. It processes
the data cube by:
Key Features:
Advantages:
Limitations:
The BUC (Bottom-Up Computation) method is a top-down recursive approach for cube
computation.
It recursively computes the cube from the base cuboid (most detailed level).
It prunes irrelevant aggregations using a threshold condition, making it suitable for
iceberg cubes.
How it Works:
Key Features:
Limitations:
3. Star-Cubing Method
How it Works:
Key Features:
Advantages:
Limitations:
The Shell Fragmentation method is used in shell cubes, where only a subset of the cube is
materialized.
How it Works:
Key Features:
Advantages:
Limitations:
✅ Dense Datasets:
o Use Multiway Method for faster processing.
✅ Sparse Datasets or Iceberg Cubes:
o Use BUC for efficient pruning and storage.
✅ Mixed Density or Large Datasets:
o Use Star-Cubing for reduced redundancy and fast performance.
✅ Frequent Query Access:
o Use Shell Cube Method for optimized storage and query efficiency.
An Iceberg Cube is a pruned data cube that contains only the aggregations (cuboids) that satisfy
a specified threshold condition.
Why "Iceberg"? The cube only materializes the “tip” of the data (significant results),
while the less relevant data remains uncomputed or is dynamically generated.
Threshold condition: It can be based on:
o SUM(sales) > 5000
o COUNT(customers) > 100
o AVG(revenue) > 2000
Goal: Reduce storage space and improve query efficiency by ignoring low-value
aggregations.
Star-Tree Features:
During the cube generation, the algorithm prunes low-value cuboids using the threshold
condition.
Example: SUM(sales) > 3000
o (Laptop, North, Q1) → ✅ Included (4000 > 3000)
o (Laptop, North, Q2) → ❌ Excluded (1500 < 3000)
o (Phone, West, Q1) → ✅ Included (6000 > 3000)
o (Tablet, South, Q1) → ❌ Excluded (2000 < 3000)
Result:
Example Query:
Dataset:
Efficiency:
Faster Queries:
What is Star-Cubing?
Star-Tree Characteristics:
Explanation:
less
Copyedit
[*] → Root
/|\
Laptop Tablet Phone
/ \ | |
North North South West
| | | |
Q1 Q2 Q1 Q3
| | | |
4000 2000 3500 6000
Pruning:
less
CopyEdit
[*] → Root
/ \
Laptop Phone
| |
North West
| |
Q1 Q3
| |
4000 6000
1. Tree Construction:
o Insert raw tuples into the Star-Tree.
o Merge nodes with common prefixes.
2. Iceberg Pruning:
o Apply the threshold condition.
o Prune low-value branches dynamically.
3. Shell Fragment Precomputation:
o Precompute frequently queried cuboids.
o Store them in memory.
4. Query Execution:
o Use precomputed shell fragments for frequent queries.
o Dynamically generate other cuboids.
5. Python Implementation of Star-Cubing
Dataset:
Efficiency:
Reduced Storage:
Faster Queries:
In OLAP (Online Analytical Processing), advanced cube techniques are used for:
How It Works:
1. Random sampling:
o A random subset of tuples is used to build the cube.
2. Stratified sampling:
o The dataset is divided into strata based on dimension values, and a sample is
drawn from each stratum.
3. Clustered sampling:
o Data is grouped into clusters, and a random sample of clusters is used.
Example:
Full Cube:
o (Product, Region, Time) → Aggregates across the entire dataset.
Sampling Cube:
o Only uses 10% of the dataset → Faster query execution with approximate
results.
Use Cases:
How It Works:
1. Cube construction:
o The cube stores only the top-K tuples per cuboid.
2. Query execution:
o Queries retrieve only the ranked results.
3. Indexing:
o Efficiently stores ranked results using multi-level indexing.
Example:
Full Cube:
o (Product, Region, Time) → Aggregates across all dimensions.
Ranking Cube:
o (Product, Region) → Top-5 products by sales in each region.
SQL Example:
sql
CopyEdit
SELECT product, region, SUM(sales) AS total_sales
FROM sales_data
GROUP BY product, region
ORDER BY total_sales DESC
LIMIT 5; -- Top-5 results
Use Cases:
Cube Space Exploration involves analyzing data cubes across multiple dimensions.
It supports:
o Slicing and dicing data across dimensions.
o Drill-down and roll-up analysis.
o Dimension reduction and aggregation for efficient exploration.
o
1. Slicing:
o Selects a single dimension to analyze (fixed value on another dimension).
o Example: Sales by region for Q1 only.
2. Dicing:
o Selects a subcube from multiple dimensions.
o Example: Sales of Laptops in North region for Q1 and Q2.
3. Roll-up:
o Aggregates data by climbing the dimension hierarchy.
Use Cases:
How It Works:
1. Training phase:
o Historical data is used to train the prediction model.
2. Cube generation:
o The prediction model is applied to generate the cube with predicted values.
3. Querying phase:
o Users can query both historical and predicted values.
Example:
Historical Cube:
o (Product, Region, Time) → Sales data from the past 5 years.
Prediction Cube:
o Forecasts next year’s sales based on the historical data.
Use Cases:
sql
CopyEdit
SELECT product, region, PREDICT(sales) AS predicted_sales
FROM sales_data
WHERE time = 'Next Quarter';
6. Multi-Feature Cubes
A Multi-Feature Cube stores multiple measures and aggregations for each cuboid.
Combines statistical and descriptive metrics into a single cube.
Supports simultaneous analysis of multiple features.
How It Works:
1. Cube construction:
o Aggregates multiple features into each cuboid.
2. Query execution:
o Users can query multiple features simultaneously.
Example:
Full Cube:
o (Product, Region, Time) → Stores SUM(sales), AVG(profit), COUNT(orders).
Multi-Feature Cube:
o Supports querying all three features simultaneously.
Use Cases:
SQL Example:
sql
CopyEdit
SELECT product, region,
SUM(sales) AS total_sales,
AVG(profit) AS avg_profit,
COUNT(orders) AS order_count
FROM sales_data
GROUP BY CUBE(product, region, time);
Example:
Retail sales:
o Detects unexpected sales spikes in specific regions.
Fraud detection:
o Identifies anomalous transactions in financial data.