0% found this document useful (0 votes)
67 views29 pages

Cube Implementations

The document discusses data cube implementation and OLAP systems. A data cube is represented as a lattice of cuboids, where each node is a cuboid containing aggregated measures for a dimension combination. Cuboids allow for drill down and roll up queries. Relational and multidimensional OLAP systems are discussed for storing and querying the cube lattice. Relational OLAP stores cuboids in tables while multidimensional OLAP stores them as multi-dimensional arrays. Large scale OLAP systems like Apache Kylin, Druid, and Presto are also mentioned.

Uploaded by

Parv Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views29 pages

Cube Implementations

The document discusses data cube implementation and OLAP systems. A data cube is represented as a lattice of cuboids, where each node is a cuboid containing aggregated measures for a dimension combination. Cuboids allow for drill down and roll up queries. Relational and multidimensional OLAP systems are discussed for storing and querying the cube lattice. Relational OLAP stores cuboids in tables while multidimensional OLAP stores them as multi-dimensional arrays. Large scale OLAP systems like Apache Kylin, Druid, and Presto are also mentioned.

Uploaded by

Parv Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Cube Implementation

pm jat @ daiict

21-09-2023 Data Cube Implementation 1


Summaries in “Data Cube”

Origin: ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) 𝑓𝑎𝑐𝑡


Item Axis: 𝐼𝑡𝑒𝑚𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)

Store Axis: 𝑆𝑡𝑜𝑟𝑒𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)

Customer Axis: 𝐶𝑢𝑠𝑡𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)

Floor: 𝐼𝑡𝑒𝑚𝐼𝐷,𝑆𝑡𝑜𝑟𝑒𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)

Front Wall: 𝐼𝑡𝑒𝑚𝐼𝐷,𝐶𝑢𝑠𝑡𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) 𝑓𝑎𝑐𝑡


Right Side Wall: 𝑆𝑡𝑜𝑟𝑒𝐼𝐷,𝐶𝑢𝑠𝑡𝐼𝐷 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) (𝑓𝑎𝑐𝑡)

Rest: 𝐶𝑢𝑠𝑡,𝑆𝑡𝑜𝑟𝑒,𝐼𝑡𝑒𝑚 ℱ𝑠𝑢𝑚(𝑝𝑟𝑖𝑐𝑒) 𝑓𝑎𝑐𝑡


Data Cube is a set of “cuboids” for “each
possible subset of the given
21-09-2023 dimensions”. Correct?
Data Cube Implementation 2
“Data Cube” as
Lattice of Cuboids
GROUP BY time

GROUP BY time, item

GROUP BY time, item, location

GROUP BY time, item, location, Supplier

21-09-2023
Figure Source: Data Mining Textbook [9] Data Cube Implementation 3
Data Cube as “lattice of Cuboids”
• Data Cube is represented as a “lattice of Cuboids”, where
• Node: Cuboid
– that contains “aggregated values” (called as measures) for “a
dimension(attribute) combination”
• Edge: Parent-Child Relationship, where
– child node has one (exactly one) extra attribute (dimension) than its parent
– child node cuboid can be computed from the parent cuboid aggregations
• Data cube is a set of “cuboids” for “each possible subset of the given dimensions”.
Correct?

21-09-2023 Data Cube Implementation 4


Data Cube as “4D cuboid” 4D representation of Lattice
shown here

21-09-2023
Figure Source: Data Mining Textbook [9] Data Cube Implementation 5
Querying Lattice
• Each node is
a view.
• “dimension
hierarchies”
• Rollup, and
Drill down?

A query can be reading full


node or part of a node from the lattice

21-09-2023
Figure Source: Data Mining Textbook[9] Data Cube Implementation 6
“Cube” “location” “Data Hierarchy”
Lattices can even be maintained
for different specific attributes –
making the cube further lower
granularity

21-09-2023 Data Cube Implementation 7


Figure Source: Data Mining Textbook[9]
“Cube” “Price” “Data Hierarchy”
Lattices can even be maintained for
different specific attribute values –
making the cube further lower
granularity

Figure Source: Data Mining Textbook[9]


21-09-2023 Data Cube Implementation 8
A Simple MR approach for Lattice Computation

Questions Remain: How efficient it would be? Can the resulting cube be materialized?
If yes, where do we store it? Can we
21-09-2023 Datahave an index on dimensions?
Cube Implementation 9
A Simple Spark solution for Lattice Computation

How do we materialize, index, search?


21-09-2023 BTW, What is major limitation of Spark-SQL wrt “SQL”
Data Cube Implementation 10 ?
OLAP Systems
• Materialized “Pre-Computation” and “Pre-Aggregation” of Lattice Cuboids remain
the key of OLAP systems for response time for interactive data analysis.
• Many systems offer query optimization based on pre-computed aggregates and
automatic maintenance of stored aggregates during updating of base data.
• Materializing all combinations of aggregates may become infeasible because it takes
too much storage and initial computation time.
• Instead, modern OLAP systems adopt the practical pre-aggregation approach of
materializing only select combinations of aggregates and then reusing these to
efficiently compute other aggregates.

21-09-2023 Data Cube Implementation 11


Types of OLAP Systems
• Relational OLAP (ROLAP) – Lattice Cuboids are stored in a Relational Tables
• Multidimensional OLAP (MOLAP) [3][4] – Lattices are stored as multi-dimensional
arrays, or so?
• Hybrid OLAP (HOLAP)

21-09-2023 Data Cube Implementation 12


ROLAP: Cube Lattice as a Relation

Identify:
• Base Cuboid? what is the dimension?
• Identify 2-D, 1-D, and 0-D cuboids?

21-09-2023 Data Cube Implementation 13


Relational OLAP (ROLAP)
• “Relational OLAP (ROLAP) systems use relational database technology for storing
data, and they also employ specialized index structures, such as bit-mapped indices,
to achieve good query performance” [3]
– Here we have schemas like “Star” and “Snowflake”
– Indexes based on Dimension attributes - may span to dimension tables
• Materialized Cubes are also stored in Relational tables.
– Cube may not be stored in Full. Say only base cubes are materialized, or say up to
N-M levels are stored, and so on.

21-09-2023 Data Cube Implementation 14


Relational OLAP (ROLAP)
• Strength is “Already established and are Robust Systems”
– Large data values could be efficiently processed using robust relational systems.
• With the help of indexes, particularly “bitmap indexes”
• Still, the disadvantages are:
– could be inefficient on ad hoc queries where we do not have materialized
“Cube”.
– May not scale, unsuited for “Big Data”

21-09-2023 Data Cube Implementation 15


Multidimensional OLAP (MOLAP) systems
• Cube lattices are represented in some other structure like multi-dimensional arrays.
• “Multidimensional data cubes are stored on disks in specialized multidimensional
structures”. [3]
• “MOLAP systems typically include provisions for handling sparse arrays and apply advanced
indexing and hashing to locate the data when performing queries” [3]
• MOLAP systems generally provide more space-efficient storage as well as faster query
response times.
• Options here are
– Proprietary MOLAP systems
– Multi-dimensional databases (MDDB). Some Relational Systems also support it, for
instance, Oracle [11]
– Key value stores becoming more common in “Big Data”

21-09-2023 Data Cube Implementation 16


Multidimensional OLAP (MOLAP)
• Though Multidimensional OLAP systems are a natural choice for data cubes and all
• BUT, traditionally they have been considered unsuitable for larger data volumes
– May require high main memory

21-09-2023 Data Cube Implementation 17


[3] Pedersen, Torben Bach, and Christian S. Jensen.
21-09-2023
"Multidimensional database technology." Computer 34.12 Data Cube Implementation
(2001) 18
Multidimensional Database History

[3] Pedersen, Torben Bach, and Christian S. Jensen.


21-09-2023
"Multidimensional database technology." Computer 34.12 Data Cube Implementation
(2001) 19
OLAP Systems
• Traditionally:
– Most DBMS systems “do support”
– Special planning is required in terms of “building”, “indexing”, “updating” data
cubes
– Special extensions for extract, integrated, load or third party tools!
• Large scale OLAP systems?
– Apache Kylin on Hadoop,
– Apache Druid,
– Presto,
– Cassandra

21-09-2023 Data Cube Implementation 20


Apache Kylin
• The Kylin project was started in 2013, in eBay's R&D in China.
• In November 2014, Kylin joined the Apache Software
https://en.wikipedia.org/wiki/Apache_Kylin

• Keylin 4 is the latest stable version:

https://kylin.apache.org/docs/index.html
21-09-2023 Data Cube Implementation 21
Apache Kylin Rationale
• Kylin’s core idea is the precomputation of result sets
• It calculates all possible query results in advance according to the specified
dimensions and indicators and speed up OLAP queries with fixed query patterns. .

https://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-architecture/
21-09-2023 Data Cube Implementation 22
Apache Kylin Rationale
• Kylin’s core idea is the precomputation of result sets
• It calculates all possible query results in advance according to the specified
dimensions and indicators and speed up OLAP queries with fixed query patterns. .

https://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-architecture/
21-09-2023 Data Cube Implementation 23
Apache Kylin - Architecture

21-09-2023 Data Cube Implementation 24


https://kylin.apache.org/index.html
Apache Kylin - Architecture

21-09-2023 Data Cube Implementation 25


Apache Kylin

21-09-2023 Data Cube Implementation


https://kylin.apache.org/docs/index.html 26
21-09-2023 Data Cube Implementation
https://kylin.apache.org/docs/index.html 27
Related Readings
[1] Lee, Suan, et al. "Scalable distributed data cube computation for large-scale multidimensional data
analysis on a Spark cluster." Cluster Computing 22 (2019): 2063-2087.
[2] Chen, Wenhao, et al. "An optimized distributed OLAP system for big data." 2017 2nd IEEE
International Conference on Computational Intelligence and Applications (ICCIA). IEEE, 2017.
[3] Pedersen, Torben Bach, and Christian S. Jensen. "Multidimensional database technology." Computer
34.12 (2001): 40-46.
[4] Thomsen, Erik. OLAP solutions: building multidimensional information systems. John Wiley & Sons,
2002.
[5] Jiatao Tao, Apache Kylin4 — A new storage and compute architecture, Apache Kylin Technical Blog,
2021. https://kylin.apache.org/blog/2021/07/02/Apache-Kylin4-A-new-storage-and-compute-
architecture/
[6] Pedersen, Torben Bach, and Christian S. Jensen. "Multidimensional database technology." Computer
34.12 (2001): 40-46.

21-09-2023 Data Cube Implementation 28


Related Readings
[7] Yang Li, The future of Apache Kylin: More powerful and easy-to-use OLAP,
https://kylin.apache.org/blog/2022/01/12/The-Future-Of-Kylin/
[8] https://cwiki.apache.org/confluence/display/KYLIN/Global+Dictionary+on+Spark
[9] (Book) Han, Jiawei, Micheline Kamber, and Jian Pei. "Data mining concepts and techniques third
edition." University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University
(2012).
[10] (Book) Thomsen, Erik. OLAP solutions: building multidimensional information systems. John Wiley &
Sons, 2002.
[11] Oracle, “Understanding Multidimensional Databases”
https://docs.oracle.com/cd/E12825_01/epm.111/esb_dbag/frameset.htm?dinconc.htm

21-09-2023 Data Cube Implementation 29

You might also like