Online Analytical Processsing On Graph Data
Online Analytical Processsing On Graph Data
net/publication/335600067
CITATIONS READS
0 61
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Design and Implementation of Tools for Managent and Analysis of the Evolution of Large Volumes of Complex Data Using Graph Databases View project
All content following this page was uploaded by Alejandro Vaisman on 19 October 2020.
Abstract
Online Analytical Processing (OLAP) comprises tools and algo-
rithms that allow querying multidimensional databases. It is based on
the multidimensional model, where data can be seen as a cube such
that each cell contains one or more measures that can be aggregated
along dimensions. In a “Big Data” scenario, traditional data warehous-
ing and OLAP operations are clearly not sufficient to address current
data analysis requirements, for example, social network analysis. Fur-
thermore, OLAP operations and models can expand the possibilities of
graph analysis beyond the traditional graph-based computation. Nev-
ertheless, there is not much work on the problem of taking OLAP
analysis to the graph data model.
This paper proposes a formal multidimensional model for graph
analysis, that considers the basic graph data, and also background in-
formation in the form of dimension hierarchies. The graphs in this
model are node- and edge-labelled directed multi-hypergraphs, called
graphoids, which can be defined at several different levels of granularity
using the dimensions associated with them. Operations analogous to
the ones used in typical OLAP over cubes are defined over graphoids.
The paper presents a formal definition of the graphoid model for OLAP,
proves that the typical OLAP operations on cubes can be expressed
over the graphoid model, and shows that the classic data cube model is
a particular case of the graphoid data model. Finally, a case study sup-
ports the claim that, for many kinds of OLAP-like analysis on graphs,
the graphoid model works better than the typical relational OLAP
alternative, and for the classic OLAP queries, it remains competitive.
1
1 Introduction
Online Analytical Processing(OLAP) [9, 15] comprises tools and algorithms
that allow querying multidimensional (MD) databases. In these databases,
data are modelled as data cubes, where each cell contains one or more mea-
sures of interest, that quantify facts. Measure values can be aggregated
along dimensions, organized as sets of hierarchies. Traditional OLAP op-
erations are used to manipulate the data cube, for example: aggregation
and disaggregation of measure data along the dimensions; selection of a
portion of the cube; or projection of the data cube over a subset of its di-
mensions. The cube is computed after a process called ETL, an acronym for
Extract, Transform, and Load, which requires a complex and expensive load
of work to carry data from the sources to the MD database, typically a data
warehouse (DW). Although OLAP has been used for social network analy-
sis [10, 12], in a “Big Data” scenario, further requirements appear [5]. In the
classic paper by Cohen et al. [4], the so-called MAD skills (standing from
Magnetic, Agile and Deep) required for data analytics are described. In this
scenario, more complex analysis tools are required, that go beyond classic
OLAP [14]. Graphs, and, particularly, property graphs [8, 13], are becoming
increasingly popular to model different kinds of networks (for instance, so-
cial networks, sensor networks, and the kind). Property graphs underlie the
most popular graph databases [1]. Examples of graph databases and graph
processing frameworks following this model are Neo4j1 , Janusgraph2 (pre-
viously called Titan), and GraphFrames3 . In addition to traditional graph
analytics, it is also interesting for the data scientist to have the possibility
of performing OLAP on graphs.
From the discussion above, it follows that, on the one hand, traditional
data warehousing and OLAP operations on cubes are clearly not sufficient to
address the current data analysis requirements; on the other hand, OLAP
operations and models can expand the possibilities of graph analysis be-
yond the traditional graph-based computation, like shortest-path, centrality
analysis and so on. In spite of the above, not many proposals have been pre-
sented in this sense so far. In addition, most of the existing work addresses
homogeneous graphs (that is, graphs where all nodes are of the same type),
where the measure of interest is related to the OLAP analysis on the graph
topology [3, 17, 18]. Further, existing works only address graphs with binary
relationships (see Section 2 for an in-depth discussion on these issues). How-
ever, real-world graphs are complex and often heterogeneous, where nodes
and edges can be of different types, and relating different numbers of entities.
1
http://www.neo4j.com
2
http://janusgraph.org/
3
https://graphframes.github.io/
2
This paper proposes a MD data model for graph analysis, that considers
not only the basic graph data, but background information in the form of
dimension hierarchies as well. The graphs in this model are node- and edge-
labelled directed multi-hypergraphs, called graphoids. In essence, these can
be denoted “property hypergraphs”. A graphoid can be defined at several
different levels of granularity, using the dimensions associated with them.
For this, the Climb operation is available. Over this model, operations like
the ones used in typical OLAP on cubes are defined, namely Roll-Up, Drill-
Down, Slice, and Dice, as well as other operations for graphoid manipulation,
e.g., n-delete (which deletes nodes). The hypergraph model allows a natu-
ral representation of facts with different dimensions, since hyperedges can
connect a variable number of nodes of different types. A typical example is
the analysis of phone calls, the running example that will be used through-
out this paper. Here, not only point-to-point calls between two partners
can be represented, but also “group calls” between any number of partici-
pants. In classic OLAP [9], a group call must be represented by means of a
fact table containing a fixed number of columns (e.g., caller, callee, and the
corresponding measures). Therefore, when the OLAP analysis for telecom-
munication information concerns point-to-point calls between two partners,
the relational representation (denoted ROLAP) works fine, but when this
is not the case, modelling and querying issues appear, which calls for a
more natural representation, closer to the original data format. And here is
where the hypergraph model comes to the rescue [6]. In summary, the main
contributions of the paper are:
In addition to the above, of course all the classic analysis tools from
graph theory are supported by the model, although this topic is beyond the
scope of this paper.
3
Remark 1 This paper does not claim that the graphoid model is always
more appropriate than the classic relational OLAP representation. Instead,
the proposal aims at showing that when a more flexible model is needed,
where n-ary relationships between instances are present (and n is variable),
the model allows not only for a more natural representation, but also can
deliver better performance for some critical queries. t
u
2 Related Work
The model described in the next sections is based on the notion of property
graphs [2]. In this model, nodes and edges (hyperdeges, as will be explained
later) are labelled with a sequence of attribute-value pairs. It will be as-
sumed that the values of the attributes represent members of dimension
levels (i.e., each attribute value is an element in the domain of a dimen-
sion level), and thus nodes and edges can be aggregated, provided that an
attribute hierarchy is defined over those dimensions. Property graphs are
the usual choice in modern graph database models used in practical imple-
mentations. Attributes are included in nodes and edges mainly aimed at
improving the speed of retrieval of the data directly related to a given node.
Here, these attributes are also used to perform OLAP operations.
A key difference between existing works, and the proposal introduced
in this paper, is that the latter supports the notion of OLAP hypergraphs,
highly expanding the possibilities of analysis. This way, instead of binary
relationships between nodes, there are n-ary, probably duplicated relation-
ships, which are typical in Data Warehousing and OLAP. Further, support-
ing n-ary relationships allows naturally modelling OLAP situations where
different facts have a different number of relations, like in the group calls
case commented in Section 1, and studied in Section 6. In other words, the
model handles multi-hypergraphs. Also, the paper works over the classic
OLAP operations, and formally defines their meaning in a graph context.
This approach allows an OLAP user to work with the notion of a data cube
at the conceptual level [15], regardless the kind of underlying data (in this
case, graphs), defining OLAP operations in terms of cubes and dimensions
rather than in terms of nodes and edges. Finally, the authors have shown the
4
usefulness of this proposal in different scenarios, like trajectory analysis [7]
and typical OLAP analysis on social networks [16].
3 Data Model
This section presents the graphoid OLAP data model. First, background di-
mensions are formally defined, along the lines of the classic OLAP literature.
Then, the (hyper)graph data model is introduced.
The running example used throughout this paper analyses calls between
customers, which belong to different companies. For this, as background
(contextual) information for the graph data representing calls (to be ex-
plained later), there is a Phone dimension, with levels Phone (representing
the phone number), Customer, City, Country, and Operator. There is also a
Time dimension, with levels Date, Month, and Year. The following examples
explain this in detail.
Example 1 Figure 1 depicts the dimension schemas σ(P hone) and σ(T ime),
for the dimensions Phone and Time, respectively. In addition, there is also a
dimension denoted Id, representing identifiers, that will be explained later.
In the dimension Phone, it holds that Bottom = Phone, and there are two
hierarchies denoted, respectively, as
5
and
Phone → Operator → All.
The node Customer is an example of a level in the first of the above hi-
erarchies. For the dimension Time, Bottom = Day holds, as well as the
hierarchy Day → Month → Year → All. t
u
Country
Y ear
City Operator
M onth
Customer
Figure 1: Dimension schemas for the dimensions Time (a), Phone (b), and
Id (identifier) (c).
where the union is taken over all levels in σ(D). The edge set of this directed
acyclic graph is defined as follows. Let ` and `0 be two levels of σ(D), and
let a ∈ dom(D.`) and a0 ∈ dom(D.`0 ). Then, only if there is a directed edge
from ` to `0 in σ(D), there can be a directed edge in I(σ(D)) from a to a0 .
6
If H is a hierarchy in σ(D), then the hierarchy instance (relative to the
dimension instance I(σ(D))) is the subgraph of I(σ(D)) with nodes from
dom(D.`), for ` appearing in H. This subgraph is denoted IH (σ(D)). t
u
all
US Italy
NYC Rome
7
3.2 The Base Graph and Graphoids
As a basic data structure for modelling OLAP on graph data, the concept
of graphoid is introduced and defined in this section. A graphoid plays
the role of a multi-dimensional cuboid in classical OLAP and it is designed
to represent the information of the application domain, at a certain level
of granularity. Essentially, a graphoid is a node- and edge-labelled directed
multi-hypergraph.
In what follows, a collection of dimensions D1 , ..., Dd is assumed in the
application domain, and their schemas σ(D1 ), ..., σ(Dd ) are given. Further-
more, hierarchy instances I(σ(D1 )), ..., I(σ(Dd )) for all dimensions are given.
Finally, assume that a special dimension D0 = Id is given, to represent
unique identifiers (Figure 1(c)). The notions of attributes, node types and
edge types are defined next.
8
This means that dim(#e) is an element of {D0 , D1 , ..., Dd }ar(#e) . The tuple
dim(#n) expresses which attributes are associated with an edge of type #e,
without specifying their levels. Finally, assume that dim(#e) contains no
repetition. The identifier dimension (at its Bottom level) may appear, but
is not required. If the identifier dimension appears, this only occurs once,
among the attributes that describe edges of a certain type.
It is now possible to define the notion of graphoid.
The basic graph data that serves as input data to the graph OLAP
process, is called the base graph. A base graph plays the role of a multi-
dimensional cube in classical OLAP and is designed to contain all the infor-
mation of the application domain, at the lowest level of granularity.
4
Let A and B be bags (or sets). If the number of occurrences of each element a in A
is less than or equal to the number of occurrences of a in B, then A is called a subbag of
B, also denoted A ⊆ B.
9
Definition 4 (Base graph) Let dimensions D1 , ..., Dd be given with their
respective schemas and instances. The (D1 .Bottom, ..., Dd .Bottom)-graphoid
is called the base graph. t
u
Hyperedges represent phone calls, which most of the time involve two
phones, but which may also involve multiple phones, representing so-called
“group calls.” So, edges are all of the same type #Call and E = {#Call}.
In Figure 3, a directed hyperedge from a subset S of N to a subset T of
N is graphically represented by a coloured node which has incoming arrows
(of the same colour) from all elements of S and outgoing arrows (again of
the same colour) to all elements of T . Such a coloured construction is a
depiction of the hyperedge e = (S, T ), which will be denoted S → T from
now on.5 For example, the red and purple hyperedges {1} → {2} represent
two different phone calls from Ph1 to Ph2 , made on the same day and of
the same duration. This example explains why the model assumes bags
rather than sets. The orange hyperedge {3} → {2, 5} represents a group
call, from Ph3 to both Ph2 and Ph5 . There are six phone calls shown in
the figure. So, E is the bag {{{1} → {2}, {1} → {2}, {4} → {3}, {4} →
{5}, {3} → {2, 5}, {5} → {2, 3}}}. The edge labelling function λE associates
two attributes, with edges of type #Call, namely Date and Duration. Date
is a dimensional attribute to which the dimensional hierarchy in Figure 1
5
The nodes of S are called the source nodes of e and the nodes of T are called the
target nodes of e. The source and target nodes of e are called adjacent to e, and the set
of the adjacent nodes to e is denoted by Adj(e). Thus, Adj(e) = S ∪ T .
10
is associated. Duration is a measure attribute (which has as an associated
aggregation function, in this case, the summation).
[#Call, 10/10/2016, 4]
[#Phone, 13, Ph3 ]
[#Phone, 11, Ph1 ] [#Phone, 12, Ph2 ]
3
1 2
[#Call, 10/10/2016, 4]
[#Call, 11/10/2016, 6]
[#Call, 5/5/2016, 8]
[#Call, 10/10/2016, 3]
[#Phone, 14, Ph4 ]
4 5 [#Phone, 15, Ph5 ]
[#Call, 2/5/2016, 5]
t
u
Note that, although the base graph plays the role of a multi-dimensional
cube in classical OLAP (or a fact table in relational OLAP), a key difference
is that this cube has a variable number of “axes”, since it can represent facts
including a variable number of dimensions. The next example discusses two
graphoids whose dimensions are at different levels of granularity. Later it
will be explained how these graphoids can be obtained from the base one.
11
[#Call, 10/10/2016, 4]
[#Phone, 13, Movistar]
[#Phone, 11, ATT] [#Phone, 12, Vodafone]
3
1 2
[#Call, 10/10/2016, 4]
[#Call, 11/10/2016, 6]
[#Call, 5/5/2016, 8]
[#Call, 10/10/2016, 3]
[#Phone, 14, Vodafone]
4 5 [#Phone, 15, Movistar]
[#Call, 2/5/2016, 5]
all information in Figure 5 is at the level of Day and all information for the
dimension Phone is at the level of Company. These examples show that
there can be more than one (Time.Day, Phone.Operator)-graphoids “con-
sistent” with the given base graph. Thus, some kind of normalization is
needed. This is studied in the next section. t
u
12
[#Call, 5/5/2016, 8]
[#Call, 2/5/2016, 5]
[#Call, 10/10/2016, 4] [#Phone, 13, Movistar]
[#Phone, 11, ATT]
1 2 3
[#Call, 11/10/2016, 6]
attribute. On the right-hand side, this attribute is brought to the All level
in its dimension and gets the value all. The expected billing information is
moved to a new edge of type #HasExpectedBill, where it can be subject to
aggregation. The above operation is called the edgification of an attribute
A in a node of type #n, and it is denoted by Edgify(#n, A). t
u
[#HasExpectedBill, 880]
(a) (b)
Figure 6: (a) A node with label [#Phone, 11, Ph1 , 880], where 880 expresses
the expected bill. (b) An edgification of this node, where the expected billing
information is moved to an edge that is labelled #HasExpectedBill.
13
3.3 Minimal graphoids
In this section, the notion of minimal (D1 .`1 , ..., Dd .`d )-graphoid is de-
fined. This graphoid is obtained collapsing the nodes that have identi-
cal labels (apart from the identifier) in the original graphoid. Let G =
(N, τN , λN , E, τE , λE ) be a (D1 .`1 , ..., Dd .`d )-graphoid. If the nodes n1 , n2 ∈
N have identical labels, apart from the identifier, denoted λN (n1 ) =Id λN (n2 ),
then these nodes are identified, such that only the one with the smallest
identifier is preserved, while the others are deleted. So, if the λN -values of
the nodes n1 , n2 , ..., nk pairwise satisfy the =Id -relationship, and n1 has the
smallest identifier among them, then the nodes n2 , ..., nk are replaced by n1
and then deleted. The expression repN (ni ) = n1 , for i = 1, 2, ..., k, indicates
that n1 represents the nodes n1 , n2 , ..., nk in the minimal graph. All edges
leaving from or arriving at the nodes n2 , ..., nk are redirected to n1 . For
this purpose, the function repN is defined on subsets of the node set N : if
S ⊆ N , then repN (S) = {repN (n) | n ∈ S}. Now, the notion of minimal
graphoid is defined more formally.
14
are mapped to n by the repN -function. For edges, E 0 is defined as the bag
{{repN (e) | e ∈ E}}, which means that for each hyperedge in E, there is a
corresponding hyperedge in E 0 . This means that the cardinalities of the bags
E and E 0 are the same. t
u
For any (D1 .`1 , ..., Dd .`d )-graphoid G = (N, τN , λN , E, τE , λE ), the re-
sult of the minimisation described in this section is denoted Minimize(G),
and called the minimisation of G.
Remark 5 It is easy to see that the minimal (D1 .`1 , ..., Dd .`d )-graphoid of
a (D1 .`1 , ..., Dd .`d )-graphoid G = (N, τN , λN , E, τE , λE ) can be computed,
in the worst case, in time that is quadratic in |N | and linear in |E|. This can
be improved, for instance, with an early pruning of the nodes that will not
be contracted. Addressing this issue is beyond the scope of this paper. t u
15
4.1 Climb
The Climb-operation, intuitively, allows to define graphs at different levels
of granularity, based on the background dimensions.
4.2 Grouping
The Group-operation, both on nodes and on edges, is defined in this section.
16
The edge-grouping of G along the dimension Dk from level `k to level `0k
in all hyperedges of type #e, denoted Group(G, #e, Dk .(`k → `0k )), is defined
as Climb(G, #n, Dk .(`k → `0k )). t
u
4.3 Aggregate
In this section, the Aggr-operation on measures stored in edges is defined.
Remark 8 Although the operations Climb, Group, and Aggr, are not present
in classic relational OLAP, they are included here for several reasons: first,
they can be useful when operating on graphs in practice; second, they fa-
cilitate and make it simple the definition of the Roll-up operation, that
otherwise could be unnecessarily difficult to express. t
u
17
4.4 Roll-Up
The operations defined above allow defining the Roll-Up-operation over di-
mensions and measures stored in edges, as explained next.
Definition 9 (Roll-Up) Assume a (D1 .`1 , ..., Dd .`d )-graphoid G is given
as follows: G = (N, τN , λN , E, τE , λE ). Let Dc be a dimension that appears
in some nodes and/or hyperedges of G, that plays the role of a climbing
dimension. Let M1 , ..., Mk be dimensions that appear in the hyperedges
of type #e of G. These dimensions play the role of measure dimensions,
and it is assumed that aggregate functions F1 , ..., Fk are associated with
them. Let #n1 , ..., #nr be node types appearing in G, and let #e1 , ..., #es
be hyperedge types appearing in G. The roll-up of G over the dimensions
M1 , ..., Mk (using the functions F1 , ..., Fk ) in hyperedges of type #e, and
over the climbing dimension Dc from level `c to level `0c in nodes of types
#n1 , ..., #nr and edges of types #e1 , ..., #es , denoted
Roll-Up(G, {#n1 , ..., #nr , #e1 , ..., #es }, Dc .(`c → `0c ); #e, M1 , ..., Mk , F1 , ..., Fk ),
is defined as
Remark 9 To apply the climbing in the roll-up operation to the nodes and
edges of all possible types, the shorthand “∗” is used as follows: Roll-Up(G,
∗, Dc .(`c → `0c ); #e, M1 , ..., Mk , F1 , ..., Fk ). To aggregate over all edge types,
the notation is Roll-Up(G, ∗, Dc .(`c → `0c ); ∗, M1 , ..., Mk , F1 , ..., Fk ). t
u
4.5 Drill-Down
The Drill-Down-operation does the opposite of Roll-Up,6 taking a graphoid
to a finer granularity level, along a dimension Dd , call it a descending di-
6
Actually, this is true for a sequence of roll-up and drill-down operations such that
there are no slicing or dicing operations (explained in Sections 4.6 and 4.7) in-between.
However, for the sake of simplicity, and without loss of generality, in this paper it is
assumed that roll-up and drill-down are the inverse of each other.
18
[#Call, 2016, 13]
[#Call, 2016, 8]
[#Call, 2016, 9]
mension, and also operating over a collection of measures, using the same
aggregate functions associated with such measures. Note also that, descend-
ing from a level `d down to a level `0d along a dimension Dd is equivalent
to climbing from the bottom level of Dd , Dd .Bottom, to the level `0d along
Dd . Thus, the drill-down of G over the dimensions M1 , ..., Mk (using the
functions F1 , ..., Fk ) in hyperedges of type #e, and over the descending di-
mension Dd from level `d to level `0d in nodes of types #n1 , ..., #nr and edges
of types #e1 , ..., #es , denoted
is defined as
Given the above, in what follows the discussion is limited to the Roll-Up-
operation.
4.6 Dice
The Dice-operation over a graphoid, produces a subgraphoid that satisfies a
Boolean condition ϕ over the available dimension levels. A “strong” version
is also defined, called the s-Dice-operation. In this context, ϕ is a Boolean
combination of atomic conditions of the form D.` < c, D.` = c, and D.` > c,
where D is a dimension, ` is a level in that dimension, and c ∈ dom(D.`).
19
The expression ϕ can be written in disjunctive normal form as
_^
ϕkl ,
k l
4.7 Slice
Intuitively, the Slice operation eliminates the references to a dimension in a
graphoid. The formal definition follows.
20
is defined as the roll-up operation up to the level Ds .All over the dimensions
M1 , ..., Mk (using the functions F1 , ..., Fk ). Formally, this slice operation is
defined as Roll-Up(G, ∗, Ds .(`s → All); ∗, M1 , ..., Mk , F1 , ..., Fk ). t
u
4.8 Node-delete
The n-Delete-operation over a graphoid, deletes all nodes of a certain type
and delete, in the source and target set of all edges, the nodes of this type.
Again, although this operation is not present in classic OLAP, it is needed to
simulate the classic OLAP slice operation, as will become clear in Section 5.2.
21
Marseille
n
Paris
tio
Brussels
ca
Lo
Antwerp
1/1/2014
2/1/2014
31/1/2014
Brio Oranges
Lego Apples
Product
22
[#Location, 11, Antwerp]
1
3 [#InCube, Lego, Antwerp, 1/1/2014, 10]
[#Product, 13, Lego]
[#Sales, 10]
1
2 [#Cube, 11]
[#Time, 12, 1/1/2014]
(a) (b)
Proof 1 Let C be a data cube, and let Star(C) be its star-graphoid. The
proof is based on showing that each of the classical OLAP operations Roll-Up,
Drill-Down, Slice and Dice, over C, can be equivalently applied on Star(C).
The semantics for the classical OLAP operations is the one given in [11].
23
Roll-Up. For cube data, a roll-up operation takes as input a data cube
C, a dimension Dc and a level `i in σ(Dc ) and returns the aggregation
of the original cube along Dc up to level `c for all of the input measures
µ1 , ..., µm , using aggregate functions F1 , ..., Fm . Assume, without loss of
generality, that the roll-up starts at the Bottom level, that is, at dom(Dc ).
Also assume, for the sake of clarity of exposition, that m = 1, that is,
that there is only one measure, call it µ, with associated aggregate func-
tion F . Now, it will be shown that the roll-up Roll-Up(C, Dc .`c ; µ, F ) on
the cube C can be simulated on Star(C) by the graphoid OLAP-operation
Roll-Up(Star(C), {#Dc }, Dc .(Bottom → `c ); #eµ ; µ, F ), where #Dc is the
unique node type in Star(C) that contains information on Dc and where
#eµ is the unique edge type that contains measure information on µ.
Let (a1 , ..., ac−1 , ac+1 , ..., ad ) be an element of dom(D1 )×· · · dom(Dc−1 )×
dom(Dc+1 ) × · · · × dom(Dd ) and suppose that there are r values ac,i from
dom(Dc ) (for i = 1, ..., r) such that (a1 , ..., ac−1 , ac,i , ac+1 , ..., ad ; mi ) appear
in the cube C, and such that all ac,i roll-up to the same element, call it aru ,
that means ρDc .Bottomk →`c (a) = aru . The roll-up on C will replace these r
cells by one “new” cell which has coordinates (a1 , ..., ac−1 , aru , ac+1 , ..., ad ) in
dom(D1 ) × · · · dom(Dc−1 ) × dom(Dc .`c ) × dom(Dc+1 ) × · · · × dom(Dd ), and
which contains the aggregated measure F ({m1 , ..., mr }). In Star(C), each
one of these “new” cells will be represented by a hyperedge. To achieve this,
the following graphoid OLAP-operation is performed:
To see the correctness of this claim, the substeps in the above graph-
oid roll-up are analysed. First, Climb(Star(C), #Dc , Dc .(Bottom → `c )) is
performed; a graphoid called G1 is obtained. Compared against Star(C), in
G1 all nodes and edges remain the same, except for the nodes of type #Dc ,
which now contain values at level `c . Next, a minimisation is performed
(to obtain a grouping on Dc ), which may contract some nodes in G1 into
“roll-up” nodes. Call the resulting graphoid G2 . These roll-up nodes of G2
simulate the “new” cells in the cube that store the aggregate information.
Finally, Aggr(G2 , #eµ , µ, F ) contracts edges that have the same adjacency
set and gives them the aggregated value of µ as attribute value.
Drill-Down. As mentioned above, the drill-down to level ` can be seen as
a roll-up from the Bottom level to level `. Therefore, no proof is needed.
Slice. On data cubes, the Slice-operation takes as input a cube C, a dimen-
sion Ds and returns a cube in which the dimension Ds is dropped, and all
measures are aggregated over the dropped dimension. To drop the dimen-
sion Ds , a roll-up to the level All in this dimension is needed first, such that
24
its domain becomes a singleton. Thus, to simulate this on Star(C) using
graphoid OLAP-operations, a climb to the level All in the dimension Ds is
performed, and therefore the proof of the roll-up case holds, taking into ac-
count that all nodes representing Ds will contain the value “all”. Thus, the
slice of the cube C is simulated by Slice(Star(C), Ds ; µ, F ). There one step
missing, however. When slicing a dimension from a cube C, this dimen-
sion is deleted. In the case of the graphoid Star(C), the nodes of type #Ds
are still present in G1 = Slice(Star(C), Ds ; µ, F ). So, n-Delete(G1 , #Ds ) is
needed to delete these nodes.
Dice. Intuitively, the Dice(C, ϕ) operation, where ϕ is a Boolean condition
over level values and measures, selects the cells in a cube C that satisfy
ϕ. The resulting cube has the same dimensionality as the original cube. It
must be shown that Dice(C, ϕ) can be simulated by s-Dice(Star(C), ϕ). As in
Section 4.6, take _^
ϕ= ϕkl ,
k l
with ϕkl of the form D.` < c, D.` = c or D.` > c, where D is a dimension,
` is a level in that dimension and c ∈ dom(D.`); or µ < c, µ = c or µ > c,
where µ is a measure and c belongs to the domain of that measure.
Let (a1 , ..., ad ; c1 , ..., cm ) ∈ dom(D1 ) × · · · × dom(Dd ) → dom(µ1 ) × · · · ×
dom(µm ) be a cell of C that satisfies ϕ. Denote this by (a1 , ..., ad ; c1 , ..., cm ) |=
ϕ. The proof here requires showing that the edges ej , labelled [#µj , cj ] (that
are adjacent to the nodes [#Di , id, ai ], for i = 1, ..., d), for j = 1, ..., m, also
satisfy ϕ. From (a1 , ..., ad ; c1 , ..., cm ) |= ϕ it follows that there exists a k such
that for all l, (a1 , ..., ad ; c1 , ..., cm ) |= ϕkl holds.
If ϕkl is of the form D.` < c, D.` = c or D.` > c, then ϕkl is undefined
in the edge label and thus, it is not false in it. Furthermore, because of the
particular definition of stars in star-graphoids, where all nodes that are ad-
jacent to an edge ej carry information on unique dimensions, ϕkl is not false
in all adjacent nodes that do not contain information on D.` and it is true
in the unique adjacent node that contains information on D.`. Therefore,
the edge ej satisfies ϕkl .
If ϕkl is of the form µ < c, µ = c or µ > c, then ϕkl evaluates to true
on one of the edges ej (that contains information on that measure µ) and is
undefined on the other edges (that contain information on other measures).
On the adjacent nodes to these edges, the condition ϕkl is not false (since
these nodes do not contain information on any measures). In both cases, all
these edges satisfy ϕkl . This means that the strong dice-operation will keep
all these edges.
By a similar reasoning, it can be shown that when (a1 , ..., ad ; c1 , ..., cm ) 6|=
ϕkl , ej 6|= ϕkl holds.
25
This shows that exactly the edges (labelled [#µj , cj ]) corresponding to
cells (a1 , ..., ad ; c1 , ..., cm ), where ϕ is not satisfied are deleted from the graph-
oid Star(C) by the strong dice-operation. This completes the proof.
26
are written in Cypher, Neo4j’s high level query language.7
Although the schemas are the same in both cases, the instances differ
from each other. In one case, a call between phone P h1 , P h2 , and P h3 ,
initiated by P h1 , contains the tuples (1 , P h1 , P h2 ) and (1 , P h1 , P h3 ). In
the other case, a tuple (1 , P h1 , P h1 ) is added to the other two to indicate
that P h1 started the call. This makes a difference for queries where the user
is not interested in who did initiate the call. In what follows, both relational
representations are denoted Calls and Calls-alt, respectively.
As expressed above, the background dimensions are the same of Figure 1.
There are two slight differences, however, for practical reasons. First, for
the Time dimension, the bottom level has granularity Timestamp, since the
StartTime and EndTime attributes in the fact tables have that granularity.
That means, a new level is added to the dimension. Second, in the Phone
dimension the bottom level is the phone identifier, denoted Id, which rolls
up to the line number, denoted Number. This is because the caller and the
callee are represented as integers, as usual in real world data warehouses.
The Phone dimension is represented in a single table, keeping the constraints
indicated by the hierarchies. This representation (i.e., Star) was chosen
to provide a fair comparison. In summary, the dimension table schema is
Phone(Id, Number, Customer, City, Country, Operator).
27
Table 1: Dataset sizes for the relational representation
Dataset tuples Calls tuples Calls-alt calls tuples Phone
D1 293,817 420,517 126,700 793
D2 528,408 756,117 227,709 4,689
entity nodes, namely #Phone and #Call, to represent call facts. These are
linked through edges labelled #creator and #receiver, the former going from
the phone that initiated the call, to the node representing such call. Back-
ground dimensions are represented in the same graph, using the entity nodes
#Operator, #User, #City and #Country for the dimension levels. Finally, di-
mension levels are linked using the edges of types #provided by, #has phone,
#belongs to and #lives in. It can be observed that nodes are not duplicated.
6.3 Datasets
For the relational representation, synthetic datasets of two different sizes
are generated and loaded into a PostgreSQL database. Table 1 depicts the
sizes of the datasets. The first column shows the number of tuples in the
Calls fact table. The second column shows the number of tuples in the Calls-
alt fact table. The third column indicates the number of calls (only one
column, since the number of calls is the same in both versions), and the
fourth column tells the number of tuples in the Phone dimension table.
For the graph representation, Table 2 depicts the main numbers of
elements in the Neo4j graph.
6.4 Queries
This section shows how different kinds of complex analytical queries can be
expressed and executed over the three representations described above. Four
kinds of OLAP queries are discussed: (a) Queries where the aggregations are
performed for pairs of objects (e.g., phone lines, persons, etc.); (b) Queries
where aggregations are performed in groups of N objects, where N > 2; (c)
For (a) and (b), rollups to different dimension levels are performed.; (d)
28
Graph OLAP-style aggregations performed over graph metrics. The idea of
these experiments is to study if, when the queries can take advantage of the
graph structure, graphoid-OLAP queries are more concisely expressed, and
more efficiently executed. The impact of N in the relational and the graph
representation is also studied. The queries are described next. For the sake
of space, only some of the SQL and Neo4j queries are shown.
This query computes all the N -subsets of lines that participated in some
call. That means, if a call involves 3 lines, say P h1 , P h2 and P h3 , and N = 2,
the groups will be (P h1 , P h2 ), (P h1 , P h3 ), and (P h2 , P h3 ). Figure ?? shows
the recursive SQL query for the first representation alternative.
This analyses a roll-up to the level Operator, which has less instance mem-
bers than the level User addressed in Query 2.
Query 4 For each pair of Phones in the Calls graph, compute the shortest
path between them.
This query aims at analysing the connections between phone line users,
and has many real-world applications (for example, to investigate calls made
between two persons who use a third one as an intermediary). From a
technical point of view, this is an aggregation over the whole graph, using
as a metric the shortest path between every pair of nodes.
Finally, the following queries combine the computation of graph metrics
together with roll-up and dice operations.
Query 5 Compute the shortest path between pairs (p1 , p2 ) of phone lines,
such that p1 corresponds to operator “Claro” and p2 corresponds to operator
“Movistar”.
Query 6 Compute the shortest path between pairs (p1 , p2 ) of phone lines,
such that p1 corresponds to a user from the city of Buenos Aires and p2
corresponds to a user from the city of Salta.
Query 7 Compute the shortest path between pairs (p1 , p2 ) of phone lines,
such that p1 corresponds to a user from the city of Buenos Aires.
29
6.5 Results
Table 3 shows the results of the experiments. The tests were ran on machine
with a i7-6700 processor and 12 GB of RAM, and 250GB disk (actually, a
virtual node in a cluster). The execution times are depicted, and are the
averages of five runs of each experiment, expressed in seconds. The winning
alternatives are marked in boldface, for clarity.
Dataset Calls Calls Calls Calls-alt Calls-alt Calls-alt Neo4j Neo4j Neo4j
N =2 N =3 N =4 N =2 N =3 N =4 N =2 N =3 N =4
D1-Q1 4.9 7.6 9.5 5.4 8.7 10.6 7.3 11.2 12.5
D1-Q2 4.6 11.7 12.9 4.4 12.3 14.5 7 11.7 14.8
D1-Q3 6.6 7.3 11.5 12.8 12.6 14.7 3.7 10.8 15.5
D1-Q4 ∞ N/A N/A ∞ N/A N/A 185 N/A N/A
D1-Q5 ∞ N/A N/A ∞ N/A N/A 21 N/A N/A
D1-Q6 ∞ N/A N/A ∞ N/A N/A 6 N/A N/A
D1-Q7 ∞ N/A N/A ∞ N/A N/A 34 N/A N/A
D2-Q1 9.3 14.1 15.1 10.4 16.2 17.7 15.6 17.5 21.6
D2-Q2 12.9 19 20.7 14.5 24 26.8 20.2 21.6 24.8
D2-Q3 12.5 19.4 22.2 14.3 14.6 22.8 9.3 18.7 28.4
D2-Q4 ∞ N/A N/A ∞ N/A N/A ∞ N/A N/A
D2-Q5 ∞ N/A N/A ∞ N/A N/A 677 N/A N/A
D2-Q6 ∞ N/A N/A ∞ N/A N/A 123 N/A N/A
D2-Q7 ∞ N/A N/A ∞ N/A N/A 924 N/A N/A
30
sonable time for the largest of the two datasets (D2) but performance is
acceptable for D1. On the other hand, the relational alternatives do not
terminate successfully neither for D1 nor for D2. It is important to make
it clear that with an ad-hoc relational design, specifically for graph repre-
sentation, it is possible that the performance of the relational alternative
for shortest path aggregations could be improved, although it will hardly be
close to the graph alternative, given the results presented here. However,
the intention of this paper is to present a flexible model that can perform
efficiently on a variety of situations. In this sense, the tests presented here
suggest that the graphoid data model can be competitive with the relational
model for classic OLAP queries, but is much better for typical Graph OLAP
ones.
31
Acknowledgments
Alejandro Vaisman was supported by a travel grant from Hasselt University
(Korte verblijven–inkomende mobiliteit, BOF16KV09). He was also par-
tially supported by PICT-2014 Project 0787 and PICT-2017 Project 1054.
The authors also thank T. Colloca, S. Ocamica, J. Perez Bodean, and N.
Castaño, for their collaboration in the data preparation for the experiments.
References
[1] R. Angles. A Comparison of Current Graph Database Models. In
Proceedings of ICDE Workshops, pages 171–177, Arlington, VA, USA,
2012.
[3] C. Chen, X. Yan, F. Zhu, J. Han, and P. Yu. Graph OLAP: a multi-
dimensional framework for graph data analysis. Knowl. Inf. Syst.,
21(1):41–63, 2009.
[5] Alfredo Cuzzocrea, Ladjel Bellatreche, and Il-Yeol Song. Data Ware-
housing and OLAP over Big Data: Current Challenges and Future Re-
search Directions. In Proceedings of DOLAP, pages 67–70, New York,
NY, USA, 2013. ACM.
[9] Ralph Kimball. The Data Warehouse Toolkit. J. Wiley and Sons, 1996.
32
[10] M. B. Kraiem, J. Feki, K. Khrouf, F. Ravat, and O. Teste. Modeling
and OLAPing social media: the case of twitter. Social Netw. Analys.
Mining, 5(1):47:1–47:15, 2015.
[11] Bart Kuijpers and Alejandro A. Vaisman. An algebra for OLAP. In-
telligent Data Analysis, 21(5), 2017.
[14] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang.
Extracting top-k insights from multi-dimensional data. In Proceedings
of ACM SIGMOD, Chicago, IL, USA, May 14-19, 2017, pages 1509–
1524, 2017.
[18] Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph Cube: on
warehousing and OLAP multidimensional networks. In Proceedings of
ACM SIGMOD, pages 853–864. ACM, 2011.
33