03 - A Survey On OLAP
03 - A Survey On OLAP
K. Dhanasree C. Shobabindu
Dept of CSE, DRKIST Dept of CSE, JNTUA College of Engineering
Hyderabad, Telangana, India Anantapuramu, Andhra Pradesh, India
[email protected] [email protected]
Abstract--Online analytical processing is to-days major The remaining parts of the paper are organized as follows:
database technology that has completely changed the face of In section 2 we briefly discuss the classification of OLAP
decision support systems. Many of the enterprise real-time technologies. In section 3 we discuss the data accessing
analytical solutions are provided using most advanced OLAP methods. In section 4 we have discussed when and where to
methods. In this paper, we have presented the overview of the use these technologies. In section 5 we have discussed about
various OLAP technologies and their access paths. The focus of OLAP in distributed scenario. Finally section 6 concludes the
this paper is on OLAP in distributed scenario, where we pinned paper.
on the drawback of OLAPs natural indexing search. We designed
a new translated lattice called the pchrome lattice, whose nodes
are binary. We implemented the natural indexing on this II. OLAP TECHNOLOGIES
translated lattice and showed a drastic reduce in indexing search Organizations huge data is a critical resource which is in
space, search time and distributed communication cost. need of powerful tools to fetch queried information .OLAP is
one such powerful technology providing sophisticated tools
Keywords- MOLAP, ROLAP, HOLAP, B-tree, Bitmap, R-trees, R*- for an enterprise to meet its competitive goal. Currently there
trees, R-cube. are three dominant OLAP technologies:
• Multidimensional OLAP (MOLAP).
I. INTRODUCTION
• Relational OLAP (ROLAP).
In the past decades we have been using various database • Hybrid OLAP (HOLAP).
technologies to answer many of user queries either simple or
complex. The prominent use of the database technology is
seen in business enterprise where decision making is prior A. MOLAP
than transactions. Traditional database systems are In MOLAP the preprocessed data is aggregated and
transactional processing systems, which can access only few uploaded periodically in a multidimensional array structure
tuples for database reads and writes [1]. Their major called Data cube [4]. Basing on the dimensional hierarchies
drawback is they cannot handle the user decision making the data cube is divided into sub-cubes. For a data cube with n
queries. This is because decision making is an instant dimensions without hierarchies there can be a total of 2n sub
comparison of past data and present data and traditional cubes. With hierarchies defined the number of sub cubes
databases does not store any past data. To handle enormous increases. As the dimensions and dimensional hierarchies
past and present data and to support decision making queries increase the cube becomes larger with many sub-cubes. As
many of the enterprises are using an extended database such a molap query for a user requested sub-cube has to spend
technology called data warehouse. Data warehouses differ time for an on fly analysis. To make this on fly analysis faster
very much from the traditional database applications. Data what followed by molap is pre-computation. Pre-computation
warehouses are mainly used by major business enterprises, to is a generic support for short response times where some of
analysis their business trends and to track their business the sub-cubes are materialized [5]. Materialization is way
profits. Analysts use the data warehouse to extract the where some of the needed measures like sum, average are
business information that enables better decision making. This calculated pre hand and the values are stored in the sub-cubes.
type of interactive decision making process is provided by In molap all these measures are stored in arrays, referenced by
OLAP (On-line Analytical Processing) tools [2]. These OLAP dimensional names that are strings. Between the warehouse
applications mostly use only data reads for their decision and the user front end tools a Molap cube sits analyzing the
making. Real time complex analytical queries are answered user requested data. For a Molap cube with huge dimensional
using OLAP. hierarchies many of the smaller granules of the cube will be
The most commonly used OLAP technologies are left pre-computed. This is what is the dimensional cursity[6]
Multidimensional On-line Analytical Processing (MOLAP), of the data cube, where many sparse sub-cubes are generated.
Relational on-line Analytical Processing (ROLAP) and hybrid The main problem with sparsity is many of the olap
on-line Analytical Processing (HOLAP) [3].They are different methodologies will search through the sparse cube to identify
in their data processing capabilities. They have their own whether the user requested sub-cube is materialized or not.
supporting data accessing methodologies. Though they are This may increase the query waiting time. Research has
opposing technologies they are widely recognized by many of provided with many methodologies on which sub-cubes to
the today’s decision making enterprises. materialize [7]. To our knowledge there is less work done on
There are many approaches on how pre-computation is For instance with an initial sort of ABCD, the prefix group-
performed: bys ABC,AB,A can be pre-computed without actually sorting
them ,thus reducing additional sorts.
1. Multi-way
Multi-way array aggregation discussed in [13] pre- 4. Hashing
computes the aggregates using array as its basic structure and The hash based method is based on optimizations of cache
is a full cube computation method. It makes use of chunk results and scans [16]. Usual pre-computation methods incur
concept where the entire cube memory is partitioned into multiple scans of the dimensional attributes which is costly.
chunks. These chunks are then simultaneously aggregated For instance in one scan the aggregate ABC is pre-computed.
across multiple dimensions to pre-compute various sub-cubes. To compute AB, again we have to scan AB once, thus taking
The multi-way array aggregation is faster as it is done on the two scans of the same attributes.
molap structure using a direct array addressing.
Instead the hash based method caches the result to further
Figure 4 shows the multi-way array aggregation where reduce the scans. For example the hash based method
ABCD is a base cuboid. Memory chunking is done to fit maintains hash tables in memory where AB and AC can fit.
ABCD and from ABCD cuboids ABC,AB,A etc, can be Now in one scan of ABC both AB, AC can be pre-computed.
calculated allowing multiple aggregations across various
dimensions. 5. H-Cubing
A better cube computation is offered by H-cubing
ABCD discussed in [17].H-cubing computes on a tree like data
structure called the H-tree. From the lattice structure shown in
Figure 5 H-cubing constructs H-tree from which it computes
ABC BCD
the multidimensional aggregates. The advantage of H-cubing
is being in one level the method calculates the possible
AB BC aggregates within the same level before proceeding to the next
higher level.
A
6. Star Cubing
Fig. 4. Multi-way Star cubing discussed in [18] combines the features of
multi-way, BUC and H-cubing. It combines both top-down
The multi-way algorithm is infeasible for large number of and bottom-up computation approaches. From the lattice
dimensions, because the larger arrays may not fit into the structure shown in Figure 5 it constructs a star tree and
chunks. Even the method continues to compute unnecessary identifies the star nodes as the nodes not satisfying iceberg
aggregates without prior pruning them. conditions and prunes them. The advantage of star cubing is
being in one level the method even computes the aggregates of
2. BUC both lower and higher next levels using shared dimensions and
The bottom up construction (BUC) method addresses the simultaneously prunes the aggregates not satisfying iceberg
partial cube computation with iceberg conditions [14]. It conditions.
makes use of an apriori based pruning method where the cells
not satisfying a minimum threshold are pruned off to further B. OLAP Indexing
not be included in the pre-computation of other aggregates.
To support fast accessing to multidimensional aggregates
For the dimension A, if A is not satisfying the minimum the olap systems follow indexing. Many existing indexing
threshold then it cannot support to the aggregation of AB and methods are followed by both molap and rolap. We examine
ABC as well. BUC method is efficient in optimally utilizing each in the context of our survey.
the available memory by priorily pruning the unnecessary 1. Natural Indexing
aggregations.
Natural indexing also called array based indexing is
supported by MOLAP. The array structure of the MOLAP
3. Sorting
itself forms the natural indexing. Natural indexing is the only
The sort based methods are based on optimizations on the indexing method which is done on the storage layer of the data
sorted aggregates [15]. Usually any data warehouse model cube [19]. As the indexing is done on the storage layer it is
follows an order of the dimensional design. Irrespective of the faster and the requested views are retrieved in no time. The
query dimensional order the application has to sort query order lattice structure of the cube in Figure 5 presents arrays as
in accordance to the design order. The sort based method layered sub cubes at various levels. The end points of each
optimizes the sorts by priorily sorting the required group by layer represents one possible group-by of the dimensions
and then pre-computing the prefix group-bys from the initial which are usually strings .In natural indexing all these string
sort. nodes are stored in string arrays . Whenever user queries for
2016 IEEE International Conference on Computational Intelligence and Computing Research
an aggregate ABC , these aggregates are directly mapped on to the above query the group-by ABC can be assigned a BEx
the end points of the layers using natural indexing and the variable of the type SAPBWOODPP2. Then using the cognos
corresponding sub cube which is highlighted in Figure 5 is locate option the BEx variables are matched.
retrieved. Thus MOLAP’s natural indexing offers improved
performance by directly indexing on the structure of MOLAP. In the context of our survey here we project the drawback
But many enterprises prefer other indexing techniques as it of BEx queries: BEx variables are also long strings and using
doesn’t suit for large data sets. The major drawbacks with data string comparisons the search index file size may be large and
cube natural indexing are: there by the search time too long. We want to go with a
• When number of dimensions is more the cube becomes method where in the transformed query the user requested
sparser [20], that means several cells that represent string group-bys are transformed to unique binaries; there by
particular attribute combinations will not contain any the search index file size and search time is reduced.
aggregated data. There by the natural indexing search
for which sub-cubes are pre-computed becomes time 2. Tree Based Indexing And Variations
consuming. Both MOLAP and ROLAP supports tree based indexing
• The natural indexing search directly uses the user methods. One of the traditional tree based indexing is the B-
group-by dimensions, which is string data. This type of tree [22]. A B–tree indexing includes sub trees corresponding
search using string data even increases storage space to each dimension of the data cube. As the values of the cube
for large dimensions. dimensions are unique, B-tree uses these dimensions as index
pointers that point to the sub trees. By tracing the pointers,
data can be easily retrieved. For an 8 bytes column the B-tree
ALL…………………….level 0 index file size is 326 MB and the construction time is
1580s.To build index on a large column B-tree is expensive in
terms of space and construction time. The main drawback with
B-tree indexes are rebalancing the tree is needed with updates.
A B C D …level 1
Other popular tree based indexing structures that are supported
by both MOLAP and ROLAP technologies are R-trees [23],
aR-trees [24].The R-tree indexing supports complex range
queries to some extent. Much research was done on R*-trees to
extend them into structures like Ra*-trees [25], Hilbert R-trees.
AB AC AD BC BD CD…l2
All of these uses more sophisticated update algorithms; they
can answer complex range queries; they can dynamically
rebalance the tree structure whenever updates are performed.
The major drawbacks of tree based indexing are:
ABC ABD ACD BCD..level 3
• Huge storage.
• Supports only few dimensions.
3. Bitmap Indexing And Variations
ABCD…………………level 4 Bitmap indexing was introduced to enhance the
performance on various query types [26]. For each attribute of
Fig. 5. Lattice structure of cube with 4 dimensions the table one bitmap index is associated. Each row of the
bitmap vector is given a row-id starting from 0. Rows will
For instance consider the sql query
have distinct attribute values .The basic idea behind bitmap
SQL query 4: indexes is to use a string of binary numbers to indicate
Select * from sales whether the indexed attribute in a table is equal to a specific
Cube by A,B,C; value or not. If the bit is set to 1, it indicates that the row with
the corresponding row-id contains the key value; otherwise the
Here the molap indexing directly indexes the outer view ABC bit is set to 0. Complex queries on one or more dimensions
which is directly fetched from the string array using string can be answered by intersecting the bit maps over multiple
comparison technique and the view highlighted in figure 5 is dimensions and also by using AND/OR operations. The major
retrieved. This type of string comparison may take huge index advantages of bitmaps are:
file size and high comparison time and even the retrieval time
is bit increased. • Overcomes the storage limitation of B-trees.
• Bitmaps are retrieval efficient for low cardinalities.
To reduce the index file size and high string comparison • Sparse data can be efficiently handled.
time the MOLAP based model IBM Cognos8 uses a • More CPU efficient because of their simple
transformer module where user queries are transformed to representation.
BEx queries (Business Explorer queries) [21]. In the BEx The major disadvantages of bitmaps are:
queries the user requested string group-bys are represented • Efficiency decreases for high cardinalities.
using BEx variables that are also long strings. For example in • AND/OR operations are expensive.
2016 IEEE International Conference on Computational Intelligence and Computing Research
• As the dimensions increases more bitmap vectors are TABLE 2. OLAP TECHNOLOGIES
needed; results in overhead of storage space. OLAP Features Adopted by
• Cannot support huge reads/updates. Technology
To address this storage overhead encoded bit maps [27], High Microsoft SQL
MOLAP performance, less server 2005, Essbase
hybrid bitmap methods are introduced [28].The encoded scalable for huge server from
bitmap indexing can be used for large cardinalities. The basic dimensions hypersion
idea of encoded bitmap indexing is to encode the attribute
domain. There by we can reduce the number of bit vectors and Low performance, Microsoft SQL
thus reduce the storage space. ROLAP More scalable for server 2005,
huge dimensions Micro strategy’s Dss
Other variations of bitmaps: projection bitmap, bit-sliced server, Informixs
indexes are discussed in [29]. meta cube
E. Query Processing Cost surveyed that in distributed scenario OLAP redirects the query
in a translated form which includes the group by attributes that
Early olaps are criticized for being inefficient in handling are strings. Communicating string group-bys to more than one
complex queries with huge operations. The efficiency of node may increase the communication cost.
distributed olap is measured in the way the query is optimized
.Query optimization is a way in which complex queries are The GMDJ relations discussed above are used to count the
transformed to include less cost operations. Many of the number of query redirects to various distributed nodes. While
distributed olaps follow query optimization to reduce the redirecting these GMDJ relations, the SKALLA system
processing cost by using transformation mechanisms in the includes reduced base table of the original query thus reducing
search space. A desirable optimization is the one which incurs the communication cost. But the reduced base relations still
less cost by reducing the search space and minimize the includes with the string dimensions which may increase the
response time. communication cost. We are now with a problem of:
The distributed olap optimizations include many Problem 2 description: Can there be a better translation of
transformation techniques where the original query is group-bys from string data to somewhat like binary, there by
translated to some sort of algebraic expressions which communicating binaries instead of strings decrease the
represent the original query. Evaluation of the user query communication cost. We started our work by combining these
using these algebraic expressions is of less cost when two problems and planning to publish as our future extension
compared to evaluation of the original query. Many of the to distributed OLAP technologies. Using this translation of
distributed query evaluation techniques are successful in cube-by dimensions to binaries we can address the problem of
minimizing the search time but failed to reduce the search cube-by operator and can make molap less costly thus can
space. make the molap technology to be adopted by all.
The SKALLA system discussed in [36] uses
Multidimensional Join (MDJ) and Generalized MDJ (GMDJ) VI. CONCLUSIONS AND FUTURE WORK
operators for expressing olap queries. The GMDJ operator
optimizes the complex olap queries by separating the In this paper we discussed about prominent OLAP
aggregate functions, the definitions and the dimensions from technologies and their accessing methods. Though MOLAP
the complex query into operator notations that are of less cost. and ROLAP are different in features both are considerably
While separating the query into GMDJ expressions the extending their services to real time decision making. At the
SKALLA system still includes the string dimensions of the beginning many enterprises are adopting MOLAP then after a
cube by query. These types of GMDJ expressions with string better acquaintance with the usage they are switching to
dimensions may decrease the query response time but the ROLAP. We provided a survey on various standard olap
search space which includes the string dimensions may still be accessing methods and their disadvantages. In the context of
large. our survey we projected on MOLAPs advantage of fast search
and retrieval. Because of molaps dimensional cursity problem
For instance let A, B be two table, f1,f2…fn be the list of many enterprises are moving to rolap Even the distributed olap
aggregate functions, a1,a2…an and b1,b2….bn be the suffered from increasing the communicating cost. Research in
dimensional attributes of A,B. The GMDJ expression is also a data cube technology is still arriving with new indexing
relation of the type (f1Aa1, f2Aa2…..fnAan, f1Bb1….fnBbn). methods. Some of these methods are still under our study and
Since the search is done using these GMDJ expressions that we will project them in our future work.
include the dimensions and dimensional hierarchies
We are working towards the problem of making MOLAP
Aa1,Aa2…,Bb1,Bb2…. ,that are strings , increases the search technology to be used in such a way to reduce sparsity and
space. render fast search; and also communicating the group-bys in
F. Communication Cost the distributed molap cube architecture so as to reduce the
communication cost.
A decreased communication cost increases the efficiency
of the OLAP technology. As a future enhancement we want to map molap lattice
structure to a compressed lattice structure whose nodes are
For example consider the SQL query: binaries rather than strings. We want to go with a query
transformation mechanism which translates the string cube-by
SQL query 6: dimensions to the binaries which exactly represent the lattice
Select A, B, C, sum(s) nodes. Thus the search can be carried on the compressed
Cube by A, B, C,D; lattice with binaries there by reducing the query retrieval time
as well as search space. In the distributed olap architecture if
The query is a request for the group-by with 4 attributes the cube-by view is not present at a location then instead of
A,B,C,D that are strings. This group-by represents a communicating the string cube-by dimensions our method
materialized view of the whole MOLAP cube shown in figure communicates binaries that are empirically same as the
5. In a distributed scenario if this view is not present at a node, requested view and thereby reducing the communication cost.
then the query has to be redirected to other nodes. We have
2016 IEEE International Conference on Computational Intelligence and Computing Research