0% found this document useful (0 votes)
8 views34 pages

Online Analytical Processsing On Graph Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

Online Analytical Processsing On Graph Data

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335600067

Online Analytical Processsing on Graph Data

Preprint · September 2019

CITATIONS READS
0 61

3 authors, including:

Bart Kuijpers Alejandro Vaisman


Hasselt University Instituto Tecnológico de Buenos Aires
149 PUBLICATIONS 2,159 CITATIONS 130 PUBLICATIONS 2,511 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Design and Implementation of Tools for Managent and Analysis of the Evolution of Large Volumes of Complex Data Using Graph Databases View project

Management of ETL workflows in a Big Data Environment View project

All content following this page was uploaded by Alejandro Vaisman on 19 October 2020.

The user has requested enhancement of the downloaded file.


Online Analytical Processsing on Graph Data∗
Leticia Gómez1 , Bart Kuijpers2 , Alejandro Vaisman3
arXiv:1909.01216v1 [cs.DB] 3 Sep 2019

Abstract
Online Analytical Processing (OLAP) comprises tools and algo-
rithms that allow querying multidimensional databases. It is based on
the multidimensional model, where data can be seen as a cube such
that each cell contains one or more measures that can be aggregated
along dimensions. In a “Big Data” scenario, traditional data warehous-
ing and OLAP operations are clearly not sufficient to address current
data analysis requirements, for example, social network analysis. Fur-
thermore, OLAP operations and models can expand the possibilities of
graph analysis beyond the traditional graph-based computation. Nev-
ertheless, there is not much work on the problem of taking OLAP
analysis to the graph data model.
This paper proposes a formal multidimensional model for graph
analysis, that considers the basic graph data, and also background in-
formation in the form of dimension hierarchies. The graphs in this
model are node- and edge-labelled directed multi-hypergraphs, called
graphoids, which can be defined at several different levels of granularity
using the dimensions associated with them. Operations analogous to
the ones used in typical OLAP over cubes are defined over graphoids.
The paper presents a formal definition of the graphoid model for OLAP,
proves that the typical OLAP operations on cubes can be expressed
over the graphoid model, and shows that the classic data cube model is
a particular case of the graphoid data model. Finally, a case study sup-
ports the claim that, for many kinds of OLAP-like analysis on graphs,
the graphoid model works better than the typical relational OLAP
alternative, and for the classic OLAP queries, it remains competitive.

Keywords: OLAP, Data Warehousing, Graph Database, Big Data, Graph


Aggregation

This is a draft version of the work that will appear in Volume 24(2) of the Intelligent
Data Analysis Journal, in early 2020.
1
Instituto Tecnológico de Buenos Aires, Buenos Aires, Argentina; email: lgo-
[email protected]
2
Hasselt University, Belgium; email: [email protected]
3
Instituto Tecnológico de Buenos Aires, Buenos Aires, Argentina; email: avais-
[email protected] (Corresponding author).

1
1 Introduction
Online Analytical Processing(OLAP) [9, 15] comprises tools and algorithms
that allow querying multidimensional (MD) databases. In these databases,
data are modelled as data cubes, where each cell contains one or more mea-
sures of interest, that quantify facts. Measure values can be aggregated
along dimensions, organized as sets of hierarchies. Traditional OLAP op-
erations are used to manipulate the data cube, for example: aggregation
and disaggregation of measure data along the dimensions; selection of a
portion of the cube; or projection of the data cube over a subset of its di-
mensions. The cube is computed after a process called ETL, an acronym for
Extract, Transform, and Load, which requires a complex and expensive load
of work to carry data from the sources to the MD database, typically a data
warehouse (DW). Although OLAP has been used for social network analy-
sis [10, 12], in a “Big Data” scenario, further requirements appear [5]. In the
classic paper by Cohen et al. [4], the so-called MAD skills (standing from
Magnetic, Agile and Deep) required for data analytics are described. In this
scenario, more complex analysis tools are required, that go beyond classic
OLAP [14]. Graphs, and, particularly, property graphs [8, 13], are becoming
increasingly popular to model different kinds of networks (for instance, so-
cial networks, sensor networks, and the kind). Property graphs underlie the
most popular graph databases [1]. Examples of graph databases and graph
processing frameworks following this model are Neo4j1 , Janusgraph2 (pre-
viously called Titan), and GraphFrames3 . In addition to traditional graph
analytics, it is also interesting for the data scientist to have the possibility
of performing OLAP on graphs.
From the discussion above, it follows that, on the one hand, traditional
data warehousing and OLAP operations on cubes are clearly not sufficient to
address the current data analysis requirements; on the other hand, OLAP
operations and models can expand the possibilities of graph analysis be-
yond the traditional graph-based computation, like shortest-path, centrality
analysis and so on. In spite of the above, not many proposals have been pre-
sented in this sense so far. In addition, most of the existing work addresses
homogeneous graphs (that is, graphs where all nodes are of the same type),
where the measure of interest is related to the OLAP analysis on the graph
topology [3, 17, 18]. Further, existing works only address graphs with binary
relationships (see Section 2 for an in-depth discussion on these issues). How-
ever, real-world graphs are complex and often heterogeneous, where nodes
and edges can be of different types, and relating different numbers of entities.
1
http://www.neo4j.com
2
http://janusgraph.org/
3
https://graphframes.github.io/

2
This paper proposes a MD data model for graph analysis, that considers
not only the basic graph data, but background information in the form of
dimension hierarchies as well. The graphs in this model are node- and edge-
labelled directed multi-hypergraphs, called graphoids. In essence, these can
be denoted “property hypergraphs”. A graphoid can be defined at several
different levels of granularity, using the dimensions associated with them.
For this, the Climb operation is available. Over this model, operations like
the ones used in typical OLAP on cubes are defined, namely Roll-Up, Drill-
Down, Slice, and Dice, as well as other operations for graphoid manipulation,
e.g., n-delete (which deletes nodes). The hypergraph model allows a natu-
ral representation of facts with different dimensions, since hyperedges can
connect a variable number of nodes of different types. A typical example is
the analysis of phone calls, the running example that will be used through-
out this paper. Here, not only point-to-point calls between two partners
can be represented, but also “group calls” between any number of partici-
pants. In classic OLAP [9], a group call must be represented by means of a
fact table containing a fixed number of columns (e.g., caller, callee, and the
corresponding measures). Therefore, when the OLAP analysis for telecom-
munication information concerns point-to-point calls between two partners,
the relational representation (denoted ROLAP) works fine, but when this
is not the case, modelling and querying issues appear, which calls for a
more natural representation, closer to the original data format. And here is
where the hypergraph model comes to the rescue [6]. In summary, the main
contributions of the paper are:

1. A graph data model based on the notion of graphoids;

2. The definition of a collection of OLAP operations over these graphoids;

3. A proof that the classical OLAP operations on cubes can be simulated


by the OLAP operations defined in the graphoid model and, therefore,
that these graphoid-based operations are at least as powerful as the
classical OLAP operations on cubes;

4. A case study and a series of experiments, that give the intuition of


a class of problems where the graphoid model works clearly better
than relational OLAP, whereas for classic OLAP queries, the graph
representation is still competitive with the relational alternative.

In addition to the above, of course all the classic analysis tools from
graph theory are supported by the model, although this topic is beyond the
scope of this paper.

3
Remark 1 This paper does not claim that the graphoid model is always
more appropriate than the classic relational OLAP representation. Instead,
the proposal aims at showing that when a more flexible model is needed,
where n-ary relationships between instances are present (and n is variable),
the model allows not only for a more natural representation, but also can
deliver better performance for some critical queries. t
u

The remainder of this paper is organized as follows: Section 2 discusses


related work. Section 3 presents the graphoid data model. Section 4 presents
the OLAP operations on graphoids, while Section 5 shows that the graphoid
OLAP operations capture the classic OLAP operations on cubes. Section 6
discusses a case study and presents an experimental analysis. Section 7
concludes the paper.

2 Related Work
The model described in the next sections is based on the notion of property
graphs [2]. In this model, nodes and edges (hyperdeges, as will be explained
later) are labelled with a sequence of attribute-value pairs. It will be as-
sumed that the values of the attributes represent members of dimension
levels (i.e., each attribute value is an element in the domain of a dimen-
sion level), and thus nodes and edges can be aggregated, provided that an
attribute hierarchy is defined over those dimensions. Property graphs are
the usual choice in modern graph database models used in practical imple-
mentations. Attributes are included in nodes and edges mainly aimed at
improving the speed of retrieval of the data directly related to a given node.
Here, these attributes are also used to perform OLAP operations.
A key difference between existing works, and the proposal introduced
in this paper, is that the latter supports the notion of OLAP hypergraphs,
highly expanding the possibilities of analysis. This way, instead of binary
relationships between nodes, there are n-ary, probably duplicated relation-
ships, which are typical in Data Warehousing and OLAP. Further, support-
ing n-ary relationships allows naturally modelling OLAP situations where
different facts have a different number of relations, like in the group calls
case commented in Section 1, and studied in Section 6. In other words, the
model handles multi-hypergraphs. Also, the paper works over the classic
OLAP operations, and formally defines their meaning in a graph context.
This approach allows an OLAP user to work with the notion of a data cube
at the conceptual level [15], regardless the kind of underlying data (in this
case, graphs), defining OLAP operations in terms of cubes and dimensions
rather than in terms of nodes and edges. Finally, the authors have shown the

4
usefulness of this proposal in different scenarios, like trajectory analysis [7]
and typical OLAP analysis on social networks [16].

3 Data Model
This section presents the graphoid OLAP data model. First, background di-
mensions are formally defined, along the lines of the classic OLAP literature.
Then, the (hyper)graph data model is introduced.

3.1 Hierarchies and Dimensions


The notions of dimension schema and dimension graph (or dimension in-
stance) that will be used throughout the paper, are introduced first. These
concepts are needed to make the paper self-contained, and to understand
the examples. The reader is referred to [11] for full details of the underlying
OLAP data model.

Definition 1 (Dimension Schema, Hierarchy and Level) Let D be a


name for a dimension. A dimension schema σ(D) for D is a lattice (a partial
order), with a unique top-node, called All (which has only incoming edges)
and a unique bottom-node, called Bottom (which has only outgoing edges),
such that all maximal-length paths in the graph go from Bottom to All. Any
path from Bottom to All in a dimension schema σ(D) is called a hierarchy
of σ(D). Each node in a hierarchy (that is, in a dimension schema) is called
a level of σ(D). t
u

The running example used throughout this paper analyses calls between
customers, which belong to different companies. For this, as background
(contextual) information for the graph data representing calls (to be ex-
plained later), there is a Phone dimension, with levels Phone (representing
the phone number), Customer, City, Country, and Operator. There is also a
Time dimension, with levels Date, Month, and Year. The following examples
explain this in detail.

Example 1 Figure 1 depicts the dimension schemas σ(P hone) and σ(T ime),
for the dimensions Phone and Time, respectively. In addition, there is also a
dimension denoted Id, representing identifiers, that will be explained later.
In the dimension Phone, it holds that Bottom = Phone, and there are two
hierarchies denoted, respectively, as

Phone → Customer → City → Country → All,

5
and
Phone → Operator → All.
The node Customer is an example of a level in the first of the above hi-
erarchies. For the dimension Time, Bottom = Day holds, as well as the
hierarchy Day → Month → Year → All. t
u

All All All

Country

Y ear
City Operator

M onth
Customer

Day P hone Bottom

(a) (b) (c)

Figure 1: Dimension schemas for the dimensions Time (a), Phone (b), and
Id (identifier) (c).

Definition 2 (Level, Hierarchy, and Dimension Instances) Let D be


a dimension with schema σ(D), and let ` be a level in σ(D). A level instance
of ` is a non-empty, finite set dom(D.`). If ` = All, then dom(D.All) is the
singleton {all}. If ` = Bottom, then dom(D.Bottom) is the domain of the
dimension D, that is, dom(D).
A dimension graph (or instance) I(σ(D)) over the dimension schema
σ(D) is a directed acyclic graph with node set
[
dom(D.`),
`

where the union is taken over all levels in σ(D). The edge set of this directed
acyclic graph is defined as follows. Let ` and `0 be two levels of σ(D), and
let a ∈ dom(D.`) and a0 ∈ dom(D.`0 ). Then, only if there is a directed edge
from ` to `0 in σ(D), there can be a directed edge in I(σ(D)) from a to a0 .

6
If H is a hierarchy in σ(D), then the hierarchy instance (relative to the
dimension instance I(σ(D))) is the subgraph of I(σ(D)) with nodes from
dom(D.`), for ` appearing in H. This subgraph is denoted IH (σ(D)). t
u

Remark 2 A hierarchy instance IH (σ(D)) is always a (directed) tree, since


a hierarchy is a linear lattice. The following terminology is used. If a and
b are two nodes in a hierarchy instance IH (σ(D)), such that (a, b) is in
the transitive closure of the edge relation of IH (σ(D)), then it is said that
a rolls-up to b, and denoted by ρH (a, b) (or ρ(a, b) if H is clear from the
context). Example 2 illustrates these concepts. t
u

Example 2 Consider dimension Phone whose schema σ(P hone) is given


in Figure 1 (b). Associated with this schema, there is an instance where
dom(P hone) = dom(P hone.Bottom) = dom(P hone.P hone) = {P h1 , P h2 ,
P h3 , P h4 , P h5 }. Also, at the Operator level, dom(P hone.Operator) =
{AT T, M ovistar, V odaf one}. This dimension instance I(σ(P hone)) is
depicted in Figure 2, which shows, e.g., that phone lines P h2 and P h4 cor-
respond to the operator V odaf one. t
u

all

US Italy

NYC Rome

Cust1 Cust2 Cust3


ATT Vodafone Movistar

Ph1 Ph2 Ph3 Ph4 Ph5

Figure 2: An example of a dimension instance I(σ(P hone)) for the dimen-


sion P hone.

In what follows, “sound” dimension graphs are assumed. In thses graphs,


rolling-up from the Bottom level, to the same element along different paths,
gives the same result [11], typical in so-called balanced (or homogeneous)
dimensions [15].

7
3.2 The Base Graph and Graphoids
As a basic data structure for modelling OLAP on graph data, the concept
of graphoid is introduced and defined in this section. A graphoid plays
the role of a multi-dimensional cuboid in classical OLAP and it is designed
to represent the information of the application domain, at a certain level
of granularity. Essentially, a graphoid is a node- and edge-labelled directed
multi-hypergraph.
In what follows, a collection of dimensions D1 , ..., Dd is assumed in the
application domain, and their schemas σ(D1 ), ..., σ(Dd ) are given. Further-
more, hierarchy instances I(σ(D1 )), ..., I(σ(Dd )) for all dimensions are given.
Finally, assume that a special dimension D0 = Id is given, to represent
unique identifiers (Figure 1(c)). The notions of attributes, node types and
edge types are defined next.

Attributes The set of attributes A that describe the data is defined as


A = {D.` | D ∈ {D0 , D1 , ..., Dd } and ` is a level of D}. As described in
Section 3.1, to each attribute A of A, a domain dom(A) is associated, from
which the attribute takes values.

Node types Assume a finite, non-empty set N of node types. Elements


of N are denoted by a string starting with a hashtag. For example, the
node type #Phone indicates that a node in a graph represents a phone line
number. There are also two functions, ar and dim defined on N . For each
node type #n in N , ar(#n) is a natural number, called the arity, that
expresses the number of attributes associated with a node of type #n. Also,
dim(#n) is an ar(#n)-tuple of attributes, which are dimensions defined
at the Bottom level, the first of which is the Identifier dimension. This
means that dim(#n) is an element of {Id} × {D1 , ..., Dd }ar(#n)−1 . The
tuple dim(#n) tells which attributes are associated with a node of type #n,
without specifying their levels. Finally, assume that dim(#n) contains no
repetition, which is the usual case in practice. The identifier dimension is
always used at its Bottom level.

Edge types Assume the existence of a finite, non-empty set E of edge


types, which is disjoint from the set N . Elements of E will also be denoted
by a string starting with a hashtag. For example, the node type #Call
indicates that an edge connects nodes that participate in a call. Again, also
assume the existence of the functions ar and dim on E. To each edge type
#e in N , ar(#e) is a natural number, called the arity, that expresses the
number of attributes associated with an edge of type #e. Also, dim(#e)
is an ar(#e)-tuple of attributes, which are dimensions (at Bottom-level).

8
This means that dim(#e) is an element of {D0 , D1 , ..., Dd }ar(#e) . The tuple
dim(#n) expresses which attributes are associated with an edge of type #e,
without specifying their levels. Finally, assume that dim(#e) contains no
repetition. The identifier dimension (at its Bottom level) may appear, but
is not required. If the identifier dimension appears, this only occurs once,
among the attributes that describe edges of a certain type.
It is now possible to define the notion of graphoid.

Definition 3 (Graphoid) Let D0 = Id be the identifier dimension. Let


dimensions D1 , ..., Dd be given with their respective schemas and instances.
Let `1 , ..., `d be levels for these respective dimensions. A (D1 .`1 , ..., Dd .`d )-
graphoid (or graphoid, for short, if the levels are clear from the context) is
a 6-tuple G = (N, τN , λN , E, τE , λE ), where
• N is a finite, non-empty set, called the set of nodes of G;
• τN is a function from N to N (that associates a unique type with each
node of G);
• λN is a function that maps a node n ∈ N to a string [#n, a1 , ..., aar(#n) ],
where #n = τN (n) and, if dim(#n) = (A1 , ..., Aar(#n) ), then, for
i = 1, ..., ar(#n), ai ∈ dom(Dj .`j ), if Ai is the dimension Dj . It is
assumed that different a1 -values are associated with different nodes,
since the first attribute value acts as a node identifier; λN is denoted
the node labelling function;
• E is a subbag4 of the set P(N ) × P(N ), which we call the set of (multi
hyper-)edges of G;
• τE is a function from E to E (that associates a unique type to each
edge of G); and
• λE is a function that maps a hyperedge e ∈ E to a string [#e, b1 , ...,
bar(#n) ], where #e = τE (e) and, if dim(#e) = (B1 , ..., Bar(#n) ), then,
for i = 1, ..., ar(#e), bi ∈ dom(Dj .`j ), if Bi is the dimension Dj ; λE is
called the edge labelling function. t
u

The basic graph data that serves as input data to the graph OLAP
process, is called the base graph. A base graph plays the role of a multi-
dimensional cube in classical OLAP and is designed to contain all the infor-
mation of the application domain, at the lowest level of granularity.
4
Let A and B be bags (or sets). If the number of occurrences of each element a in A
is less than or equal to the number of occurrences of a in B, then A is called a subbag of
B, also denoted A ⊆ B.

9
Definition 4 (Base graph) Let dimensions D1 , ..., Dd be given with their
respective schemas and instances. The (D1 .Bottom, ..., Dd .Bottom)-graphoid
is called the base graph. t
u

Example 3 The running example used in this paper is aimed at analysing


calls between customers of phone lines; lines correspond to different opera-
tors. Examples 1 and 2 showed some of the dimensions used as background
information. Next, the call information is shown, represented as a graph.
The Phone dimension plays the roles of the calling line and the callee lines
(this is called a role-playing dimension in the OLAP literature [15]). The
information in the hyperedges reflects the total duration of the calls between
two or more phone numbers on a given day. Figure 3 shows an example of a
base graph, where N = {1, 2, 3, 4, 5} is the node set. The nodes in this base
graph are all of the same type and represent phones (not persons–a person
may have more than one phone). In this example, N = {#Phone}. The
node type #Phone has arity 2. Its first attribute is a node identifier and
the second one is a dimensional attribute that represents the phone number,
with domain {Ph1 , Ph2 , ...}. In the example of Figure 3,

λN : i 7→ [#Phone, 10 + i, Phi ], for i = 1, ..., 5.

Hyperedges represent phone calls, which most of the time involve two
phones, but which may also involve multiple phones, representing so-called
“group calls.” So, edges are all of the same type #Call and E = {#Call}.
In Figure 3, a directed hyperedge from a subset S of N to a subset T of
N is graphically represented by a coloured node which has incoming arrows
(of the same colour) from all elements of S and outgoing arrows (again of
the same colour) to all elements of T . Such a coloured construction is a
depiction of the hyperedge e = (S, T ), which will be denoted S → T from
now on.5 For example, the red and purple hyperedges {1} → {2} represent
two different phone calls from Ph1 to Ph2 , made on the same day and of
the same duration. This example explains why the model assumes bags
rather than sets. The orange hyperedge {3} → {2, 5} represents a group
call, from Ph3 to both Ph2 and Ph5 . There are six phone calls shown in
the figure. So, E is the bag {{{1} → {2}, {1} → {2}, {4} → {3}, {4} →
{5}, {3} → {2, 5}, {5} → {2, 3}}}. The edge labelling function λE associates
two attributes, with edges of type #Call, namely Date and Duration. Date
is a dimensional attribute to which the dimensional hierarchy in Figure 1
5
The nodes of S are called the source nodes of e and the nodes of T are called the
target nodes of e. The source and target nodes of e are called adjacent to e, and the set
of the adjacent nodes to e is denoted by Adj(e). Thus, Adj(e) = S ∪ T .

10
is associated. Duration is a measure attribute (which has as an associated
aggregation function, in this case, the summation).

[#Call, 10/10/2016, 4]
[#Phone, 13, Ph3 ]
[#Phone, 11, Ph1 ] [#Phone, 12, Ph2 ]
3
1 2

[#Call, 10/10/2016, 4]
[#Call, 11/10/2016, 6]
[#Call, 5/5/2016, 8]
[#Call, 10/10/2016, 3]
[#Phone, 14, Ph4 ]
4 5 [#Phone, 15, Ph5 ]
[#Call, 2/5/2016, 5]

Figure 3: Basic phone call data as a base graph.

t
u

Note that, although the base graph plays the role of a multi-dimensional
cube in classical OLAP (or a fact table in relational OLAP), a key difference
is that this cube has a variable number of “axes”, since it can represent facts
including a variable number of dimensions. The next example discusses two
graphoids whose dimensions are at different levels of granularity. Later it
will be explained how these graphoids can be obtained from the base one.

Example 4 Continuing with Example 3, consider two available dimensions,


namely D1 = Time and D2 = Phone. A (Time.Day, Phone.Operator)-
graphoid based on the base graph of Figure 3, is shown in Figure 4. Here,
in the Phone nodes, the phone numbers have been replaced with their cor-
responding operator name, at the Phone.Operator level in the dimension
Phone (e.g., for Ph3 , the corresponding operator is Movistar).
Figure 5 shows an alternative (Time.Day, Phone.Operator)-graphoid for
the data from Figure 3. This graphoid has N = {1, 2, 3} as a node set. The
nodes with identifiers 12 and 14 represent, respectively, Ph2 and Ph4 in the
base graph (and also in the graphoid of Figure 4), which belong to the opera-
tor Vodafone. Thus, these two nodes were collapsed into one (with identifier
12) and similarly, the nodes Ph3 and Ph5 were collapsed into one node (with
identifier 13). These operations were possible because these nodes have iden-
tical attribute values (apart from the identifier). For the dimension Time,

11
[#Call, 10/10/2016, 4]
[#Phone, 13, Movistar]
[#Phone, 11, ATT] [#Phone, 12, Vodafone]
3
1 2

[#Call, 10/10/2016, 4]
[#Call, 11/10/2016, 6]
[#Call, 5/5/2016, 8]
[#Call, 10/10/2016, 3]
[#Phone, 14, Vodafone]
4 5 [#Phone, 15, Movistar]
[#Call, 2/5/2016, 5]

Figure 4: A (Time.Day, Phone.Operator)-graphoid, based on the data shown


in Figure 3.

all information in Figure 5 is at the level of Day and all information for the
dimension Phone is at the level of Company. These examples show that
there can be more than one (Time.Day, Phone.Operator)-graphoids “con-
sistent” with the given base graph. Thus, some kind of normalization is
needed. This is studied in the next section. t
u

Remark 3 Nodes are assumed to represent basic objects in the modelled


application world. These objects are given by a number of descriptive at-
tributes. Measure information, typically present in an OLAP setting to
quantify facts, is, in this philosophy, represented as attributes on the hy-
peredges. The call duration is an example of a measure that is placed on
edges of the type Call. However, the above definition also allows for node
attributes to be dimensions that contain measure information. Consider a
slightly modified situation in which an object of type #Phone includes an
additional attribute that expresses the average (or expected) billing amount
for that particular phone number, for example, [#Phone, 11, Ph1 , 880]. In
this modified setting, a user may want to compute the average expected
billing amount over all phone lines. To answer these kinds of queries, at-
tribute values of certain types of nodes must be averaged (in the example, the
#HasExpectedBill attribute). However, in the model presented here, aggre-
gations are only performed on attribute values of hyperedges. Whenever this
problem occurs, the representation can be modified as illustrated in Figure 6.
On the left-hand side, there is a node that includes the #HasExpectedBill

12
[#Call, 5/5/2016, 8]

[#Call, 2/5/2016, 5]
[#Call, 10/10/2016, 4] [#Phone, 13, Movistar]
[#Phone, 11, ATT]
1 2 3

[#Phone, 12, Vodafone]


[#Call, 10/10/2016, 4]
[#Call, 10/10/2016, 3]

[#Call, 11/10/2016, 6]

Figure 5: An alternative (Time.Day, Phone.Operator)-graphoid, based on


the data shown in Figure 3.

attribute. On the right-hand side, this attribute is brought to the All level
in its dimension and gets the value all. The expected billing information is
moved to a new edge of type #HasExpectedBill, where it can be subject to
aggregation. The above operation is called the edgification of an attribute
A in a node of type #n, and it is denoted by Edgify(#n, A). t
u

[#HasExpectedBill, 880]

[#Phone, 11, Ph1 , 880]


1 1
[#Phone, 11, Ph1 , all]

(a) (b)

Figure 6: (a) A node with label [#Phone, 11, Ph1 , 880], where 880 expresses
the expected bill. (b) An edgification of this node, where the expected billing
information is moved to an edge that is labelled #HasExpectedBill.

13
3.3 Minimal graphoids
In this section, the notion of minimal (D1 .`1 , ..., Dd .`d )-graphoid is de-
fined. This graphoid is obtained collapsing the nodes that have identi-
cal labels (apart from the identifier) in the original graphoid. Let G =
(N, τN , λN , E, τE , λE ) be a (D1 .`1 , ..., Dd .`d )-graphoid. If the nodes n1 , n2 ∈
N have identical labels, apart from the identifier, denoted λN (n1 ) =Id λN (n2 ),
then these nodes are identified, such that only the one with the smallest
identifier is preserved, while the others are deleted. So, if the λN -values of
the nodes n1 , n2 , ..., nk pairwise satisfy the =Id -relationship, and n1 has the
smallest identifier among them, then the nodes n2 , ..., nk are replaced by n1
and then deleted. The expression repN (ni ) = n1 , for i = 1, 2, ..., k, indicates
that n1 represents the nodes n1 , n2 , ..., nk in the minimal graph. All edges
leaving from or arriving at the nodes n2 , ..., nk are redirected to n1 . For
this purpose, the function repN is defined on subsets of the node set N : if
S ⊆ N , then repN (S) = {repN (n) | n ∈ S}. Now, the notion of minimal
graphoid is defined more formally.

Definition 5 (Minimal graphoid) Let D0 , D1 , ..., Dd and `1 , ..., `d be the


same as in Definition 3. Let G = (N, τN , λN , E, τE , λE ) be a (D1 .`1 , ..., Dd .`d )-
graphoid. The minimal graphoid of G is the (D1 .`1 , ..., Dd .`d )-graphoid
G0 = (N 0 , τN 0 , λN 0 , E 0 , τE 0 , λE 0 ), defined as follows:

• N 0 is the set repN (N ) = {repN (n) | n ∈ N };


• τN 0 is a function from N 0 to N , defined as τN 0 (repN (n)) := τN (repN (n)),
for each n in N ;
• λN 0 is a function on N 0 defined as λN 0 (repN (n)) := λN (repN (n)), for
each n in N ;
• E 0 is a subbag of the set P(N 0 )×P(N 0 ), defined as follows: for each hy-
peredge e = S → T in E, then a new hyperedge repN (e) := repN (S) →
repN (T ) is in E 0 ;
• τE 0 is a function from E 0 to E, defined as τE 0 (repN (e)) := τE (e), for
each e in E;
• λE 0 is a function on E 0 and it is defined as λE 0 (repN (e)) := λE (e), for
each e in E.
t
u

Remark 4 The set N of nodes of G is contracted to the set N 0 = repN (N ),


therefore each node in N 0 has the smallest identifier among all nodes that

14
are mapped to n by the repN -function. For edges, E 0 is defined as the bag
{{repN (e) | e ∈ E}}, which means that for each hyperedge in E, there is a
corresponding hyperedge in E 0 . This means that the cardinalities of the bags
E and E 0 are the same. t
u

Proposition 1 immediately follows from Definition 5.

Proposition 1 For any (D1 .`1 , ..., Dd .`d )-graphoid G = (N, τN , λN , E, τE ,


λE ), its minimal (D1 .`1 , ..., Dd .`d )-graphoid always exists and it is unique.
t
u

Example 5 The two (Time.Day, Phone.Operator)-graphoids shown in Fig-


ures 4 and 5 in Example 4, correspond to the graph of Figure 3. The graphoid
of Figure 5 is the minimal graphoid of Figure 4. In this example, the orig-
inal nodes 2 and 4 are contracted into one node, namely the node 2 (since
it has the smallest identifier of the two). Similarly, the original nodes 3 and
5 are contracted into the node 3. The original node 1 remains unchanged.
Between nodes 1 and 2, there are two edges (with the same label) in the
original graph. They are copied in the minimal graph. The edges between
nodes 4 and 3, and 4 and 5, respectively, become two edges between the
nodes 2 and 3 in the minimal graph. The two hyperedges that involve nodes
2, 3 and 5 correspond to two hyperedges between the nodes 2 and 3 in the
minimal graph. t
u

For any (D1 .`1 , ..., Dd .`d )-graphoid G = (N, τN , λN , E, τE , λE ), the re-
sult of the minimisation described in this section is denoted Minimize(G),
and called the minimisation of G.

Remark 5 It is easy to see that the minimal (D1 .`1 , ..., Dd .`d )-graphoid of
a (D1 .`1 , ..., Dd .`d )-graphoid G = (N, τN , λN , E, τE , λE ) can be computed,
in the worst case, in time that is quadratic in |N | and linear in |E|. This can
be improved, for instance, with an early pruning of the nodes that will not
be contracted. Addressing this issue is beyond the scope of this paper. t u

4 OLAP Operations on Graphs


In this section, the operations that compose the graph-OLAP language over
graphoids are defined. Section 5 will show that these operations can simulate
the typical OLAP operations on cubes.

15
4.1 Climb
The Climb-operation, intuitively, allows to define graphs at different levels
of granularity, based on the background dimensions.

Definition 6 (Climb) Assume a (D1 .`1 , ..., Dd .`d )-graphoid G is given as


follows: G = (N, τN , λN , E, τE , λE ). Let Dk be a dimension that appears
in G, and `k and `0k be levels in the schema σ(Dk ) of this dimension, such
that `k → `0k . Also, let ρ`k →`0k be the corresponding rollup function (at the
instance level). Finally, let #n be a node type that appears in G, and #e
be an edge type that appears in G.
The node-climb-operation of G along the dimension Dk from level `k
to level `0k in all nodes of type #n, denoted Climb(G, #n, Dk .(`k → `0k )),
replaces all attribute values a from dom(Dk .`k ) by the value ρ`k →`0k (a) from
dom(Dk .`0k ), in all nodes of G of type #n, leaving G unaltered otherwise.
The edge-climb-operation of G along the dimension Dk from level `k to
level `0k in all hyperedges of type #e, denoted Climb(G, #e, Dk .(`k → `0k )),
replaces all attribute values a from dom(Dk .`k ) by the value ρ`k →`0k (a) from
dom(Dk .`0k ), in all edges of G of type #e, leaving G unaltered otherwise. t u

Example 6 Applying to the graphoid G depicted in Figure 3 the opera-


tion Climb(G, #Phone, Phone.(Phone → Operator)), results in the graphoid
shown inFigure 4. t
u

Remark 6 If a dimension Dk appears in multiple node types and edge


types, to apply the Climb-operation on many of them, the shorthand ex-
pression Climb(G, {#n1 , ..., #nr , #e1 , ..., #es }, Dk .(`k → `0k )) can be used.
Finally, Climb(G, ∗, Dk .(`k → `0k )) denotes a climbing, in the dimension Dk ,
from level `k to level `0k in all possible node and edge types. t
u

4.2 Grouping
The Group-operation, both on nodes and on edges, is defined in this section.

Definition 7 (Grouping) Assume a (D1 .`1 , ..., Dd .`d )-graphoid G is given


as follows: G = (N, τN , λN , E, τE , λE ). Let Dk be a dimension that appears
in G and let `k and `0k be levels in the schema σ(Dk ) of this dimension, such
that `k → `0k . Let ρ`k →`0k be the corresponding rollup function. Let #n be a
node type that appears in G and let #e be an edge type that appears in G.
The node-grouping of G along the dimension Dk from level `k to level `0k
in all nodes of type #n, denoted Group(G, #n, Dk .(`k → `0k )), is defined as
Minimize(Climb(G, #n, Dk .(`k → `0k ))).

16
The edge-grouping of G along the dimension Dk from level `k to level `0k
in all hyperedges of type #e, denoted Group(G, #e, Dk .(`k → `0k )), is defined
as Climb(G, #n, Dk .(`k → `0k )). t
u

Example 7 Applying to the graphoid G depicted in Figure 4 the operation


Group(G, #Phone, Phone.(Phone → Operator)), results in the graphoid, de-
picted in Figure 5. t
u

4.3 Aggregate
In this section, the Aggr-operation on measures stored in edges is defined.

Definition 8 (Aggregate) Given a minimal (D1 .`1 , ..., Dd .`d )-graphoid G


defined as G = (N, τN , λN , E, τE , λE ), let Dk be a dimension that appears in
the hyperedges of G of type #e, that plays the role of a measure, to which
the aggregate function Fk can be applied. The aggregation of the graph-
oid G over the dimension Dk (using the function Fk ), denoted Aggr(G, #e,
Dk , Fk ), results in a graphoid G0 over the same N, τN and λN as G, with
the following modified hyperedge bag E 0 . If the hyperedges e1 , e2 , ..., er are
all of type #e and all of type S → T (and if they are the only ones), and
if λE agrees on all of them apart from a possible identifier-attribute, and
apart from the dimension Dk , then the hyperedges e1 , e2 , ..., er are replaced
by one of them (say e1 ) of the same type and with the same attribute values,
apart from the identifier, which is the identifier of e1 , and the value of the
attribute Dk .`k , which becomes the value of the aggregation function Fk
applied to the values of the attribute Dk .`k of the edges e1 , e2 , ..., er . t
u

Example 8 Applying the operation Aggr(G, #Call, Duration, Sum) to the


graphoid G, depicted in Figure 5, results in a graphoid where the two
edges that connect the nodes 1 and 2 are replaced by one edge with label
[#Call, 10/10/2016, 8], which contains, in the measure attribute, the sum of
the two durations. t
u

Remark 7 To aggregate multiple dimensions M1 , ..., Mk , using the aggre-


gate functions F1 , ..., Fk simultaneously, the notation would be: Aggr(G, #e,
{M1 , ..., Mk }, {F1 , ..., Fk }). Also, for simplicity, only the typical SQL aggre-
gate functions Sum, Max,Min and Count are considered. t
u

Remark 8 Although the operations Climb, Group, and Aggr, are not present
in classic relational OLAP, they are included here for several reasons: first,
they can be useful when operating on graphs in practice; second, they fa-
cilitate and make it simple the definition of the Roll-up operation, that
otherwise could be unnecessarily difficult to express. t
u

17
4.4 Roll-Up
The operations defined above allow defining the Roll-Up-operation over di-
mensions and measures stored in edges, as explained next.
Definition 9 (Roll-Up) Assume a (D1 .`1 , ..., Dd .`d )-graphoid G is given
as follows: G = (N, τN , λN , E, τE , λE ). Let Dc be a dimension that appears
in some nodes and/or hyperedges of G, that plays the role of a climbing
dimension. Let M1 , ..., Mk be dimensions that appear in the hyperedges
of type #e of G. These dimensions play the role of measure dimensions,
and it is assumed that aggregate functions F1 , ..., Fk are associated with
them. Let #n1 , ..., #nr be node types appearing in G, and let #e1 , ..., #es
be hyperedge types appearing in G. The roll-up of G over the dimensions
M1 , ..., Mk (using the functions F1 , ..., Fk ) in hyperedges of type #e, and
over the climbing dimension Dc from level `c to level `0c in nodes of types
#n1 , ..., #nr and edges of types #e1 , ..., #es , denoted
Roll-Up(G, {#n1 , ..., #nr , #e1 , ..., #es }, Dc .(`c → `0c ); #e, M1 , ..., Mk , F1 , ..., Fk ),
is defined as

Aggr(Minimize(Climb(G, {#n1 , ..., #nr , #e1 , ..., #es },


Dc .(`c → `0c ))), #e, M1 , ..., Mk , F1 , ..., Fk ).
t
u
Example 9 Applying to the graphoid depicted in Figure 5 the operation
Roll-Up(G, {#Call}, Time.(Day → Year); #Call, Duration, Sum), results in
the graphoid of Figure 7. The minimisation step in the above implementa-
tion of the roll-up operation does nothing, in this case, since the operation
is applied to a minimal graphoid. t
u

Remark 9 To apply the climbing in the roll-up operation to the nodes and
edges of all possible types, the shorthand “∗” is used as follows: Roll-Up(G,
∗, Dc .(`c → `0c ); #e, M1 , ..., Mk , F1 , ..., Fk ). To aggregate over all edge types,
the notation is Roll-Up(G, ∗, Dc .(`c → `0c ); ∗, M1 , ..., Mk , F1 , ..., Fk ). t
u

4.5 Drill-Down
The Drill-Down-operation does the opposite of Roll-Up,6 taking a graphoid
to a finer granularity level, along a dimension Dd , call it a descending di-
6
Actually, this is true for a sequence of roll-up and drill-down operations such that
there are no slicing or dicing operations (explained in Sections 4.6 and 4.7) in-between.
However, for the sake of simplicity, and without loss of generality, in this paper it is
assumed that roll-up and drill-down are the inverse of each other.

18
[#Call, 2016, 13]

[#Phone, 11, ATT] [#Phone, 12, Vodafone] [#Phone, 13, Movistar]


1 2 3

[#Call, 2016, 8]

[#Call, 2016, 9]

Figure 7: The result of the operation Roll-Up(G, {#Phone}, Time.(Day →


Year); #Call, Duration, Sum) applied to the graphoid of Figure 5.

mension, and also operating over a collection of measures, using the same
aggregate functions associated with such measures. Note also that, descend-
ing from a level `d down to a level `0d along a dimension Dd is equivalent
to climbing from the bottom level of Dd , Dd .Bottom, to the level `0d along
Dd . Thus, the drill-down of G over the dimensions M1 , ..., Mk (using the
functions F1 , ..., Fk ) in hyperedges of type #e, and over the descending di-
mension Dd from level `d to level `0d in nodes of types #n1 , ..., #nr and edges
of types #e1 , ..., #es , denoted

Drill-Down(G, {#n1 , ..., #nr , #e1 , ..., #es },


Dd .(`d → `0d ); #e, M1 , ..., Mk , F1 , ..., Fk ),

is defined as

Aggr(Minimize(Climb(G, {#n1 , ..., #nr , #e1 , ..., #es },


Dd .(Bottom → `0d ))), #e, M1 , ..., Mk , F1 , ..., Fk ).

Given the above, in what follows the discussion is limited to the Roll-Up-
operation.

4.6 Dice
The Dice-operation over a graphoid, produces a subgraphoid that satisfies a
Boolean condition ϕ over the available dimension levels. A “strong” version
is also defined, called the s-Dice-operation. In this context, ϕ is a Boolean
combination of atomic conditions of the form D.` < c, D.` = c, and D.` > c,
where D is a dimension, ` is a level in that dimension, and c ∈ dom(D.`).

19
The expression ϕ can be written in disjunctive normal form as
_^
ϕkl ,
k l

where all ϕkl are atomic conditions.


Before giving the definition of the Dice-operation, it must be explained
what does it mean that a hyperedge e in a graphoid satisfies ϕ, denoted
e |= ϕ. For this, interpreting conjunction and disjunction in the usual way,
it suffices to define e |= ϕkl for the atomic formulas that appear in ϕ. Thus,
ϕkl cannot be evaluated in e if the label of e does not contain information
on dimension D at level `. Otherwise, ϕkl can be evaluated in e. Let ϕkl be
D.` < c, D.` = c or D.` > c; ϕkl is not false in e if it can be evaluated in
e and is true, or if it cannot be evaluated in e. The notion of ϕkl being not
false in a node n adjacent to e (that is, n ∈ Adj(e)) is defined analogously.
Finally, e |= ϕkl if ϕkl is not false in e and not false in all n ∈ Adj(e).

Definition 10 (Dice) Assume a (D1 .`1 , ..., Dd .`d )-graphoid G is given as


G = (N, τN , λN , E, τE , λE ). Let ϕ be a Boolean combination of equality and
inequality constraints that involve, on the one hand dimension levels `01 , ..., `0d
(equal or higher than `1 , ..., `d in the dimension schemas σ(D1 ), ..., σ(Dd ), re-
spectively), and on the other hand, constants from dom(D1 .`01 ), ..., dom(Dd .`0d ).
The dice over G on the condition ϕ, denoted Dice(G, ϕ), produces a sub-
graphoid of G, whose nodes are the nodes of G and whose edges satisfy
the conditions expressed by ϕ. When an hyperedge does not satisfy ϕ, the
whole hyperedge is deleted from the graph and thus, it does not belong to
Dice(G, ϕ). All other edges of G belong to Dice(G, ϕ). If two edges in G
have the same set of adjacent nodes and one of them is deleted from G in
Dice(G, ϕ), then both of them are deleted in G to obtain the strong dice
over G on the condition ϕ, denoted s-Dice(G, ϕ). t
u

4.7 Slice
Intuitively, the Slice operation eliminates the references to a dimension in a
graphoid. The formal definition follows.

Definition 11 (Slice) Assume a (D1 .`1 , ..., Dd .`d )-graphoid G is given as


G = (N, τN , λN , E, τE , λE ). Let Ds be a dimension that appears in some
nodes and/or hyperedges of G. Let M1 , ..., Mk be dimensions that appear
in the hyperedges of G. These dimensions play the role of measure dimen-
sions. It is assumed that aggregate functions F1 , ..., Fk are associated with
them. The slice of the dimension Ds from G over the dimensions M1 , ..., Mk
(using the functions F1 , ..., Fk ), denoted Slice(G, Ds ; M1 , ..., Mk , F1 , ..., Fk ),

20
is defined as the roll-up operation up to the level Ds .All over the dimensions
M1 , ..., Mk (using the functions F1 , ..., Fk ). Formally, this slice operation is
defined as Roll-Up(G, ∗, Ds .(`s → All); ∗, M1 , ..., Mk , F1 , ..., Fk ). t
u

4.8 Node-delete
The n-Delete-operation over a graphoid, deletes all nodes of a certain type
and delete, in the source and target set of all edges, the nodes of this type.
Again, although this operation is not present in classic OLAP, it is needed to
simulate the classic OLAP slice operation, as will become clear in Section 5.2.

Definition 12 (Node-delete) Assume a (D1 .`1 , ..., Dd .`d )-graphoid G is


given as G = (N, τN , λN , E, τE , λE ). Given a node type #n, the node-delete
over G operation, denoted n-Delete(G, #n), produces a subgraphoid of G,
whose nodes of type #n are deleted, and such that all edges e = S → T
are replaced by edges S #n → T #n , where S #n and T #n are S and T ,
respectively, minus the nodes of type #n. The edges remain of the same
type and they keep the same label. t
u

Example 10 When a graphoid contains only nodes of one type, as in Fig-


ure 3, the result of the deletion of a node is, obviously, the empty graph. In
the graphoid of Figure 9 (explained later), the result of n-Delete(G, #Location)
would be a graph with nodes 2 and 3, where a hyperedge containing only
these nodes would remain, with label [#Sales, 10]. t
u

5 Classical OLAP Cubes as a Special Case of OLAP


Graphs
This section explains how the classical cube-based OLAP model can be
represented in the graphoid OLAP model. It is also shown that the classical
OLAP-operations Roll-Up, Drill-Down, Slice and Dice can be simulated by
the graphoid OLAP-operations defined in Section 4.

5.1 A Discussion on Modelling Cubes as Graphoids


Figure 8 illustrates a typical example of an OLAP cube with dimensions (D1 ,
D2 , D3 ) = (P roduct, Location, T ime). The cube represents sales amounts
of products at certain stores locations (cities) on certain dates (at the low-
est level of granularity). There are several ways for representing this cube
in the graphoid model. Figure 9 shows two ways of modelling the fact
(Lego, Antwerp, 1/1/2014; 10), which expresses that the sales of Lego in the
Antwerp store on January 1st, 2014 amount to 10.

21
Marseille

n
Paris

tio
Brussels

ca
Lo
Antwerp
1/1/2014

2/1/2014

Time (Day) ...

31/1/2014
Brio Oranges
Lego Apples
Product

Figure 8: An example of a Sales data cube with one measure: µ1 = sales.

Figure 9(a) shows nodes 1, 2 and 3, of types #Product, #Location and


#Time, respectively. All of them have only one attribute, to store the values
Lego, Antwerp and 1/1/2014, call those attributes ProductVal, LocationVal
and TimeVal, respectively. Further, those attributes are dimensions, with
an appropriate dimension schema. The measure information is stored in
the hyperedge ∅ → {1, 2, 3} with label [#Sales, 10], which has one attribute,
namely SalesVal, to store the sale amount (10, in this case). Thus, in this
approach, each cell of a data cube is modelled by a “star”-shaped hyperedge.
A more compact representation is shown in Figure 9(b). Here, there is
only one node, of type #Cube in the graphoid, which represents the data
cube. This node is labelled [#Cube, 11], and has no attribute values (apart
from an identifier value). Cell-coordinates and cell-content are stored in hy-
peredges that form loops around the node. The fact (Lego, Antwerp, 1/1/2014 ; 10)
is modelled by a unique hyperedge with label [#InCube, Lego, Antwerp, 1/1/2014,
10]. Thus, cube facts are represented by a hyperedge of type #InCube that
has four attributes: ProductVal, LocationVal, TimeVal and SalesVal.
In between the two alternatives above, there are, obviously, more mod-
elling possibilities. The next section will show that the graphoid OLAP-
operations presented in Section 4, are at least as powerful as the classical
OLAP-operations of the classical cube model. The proof will assume the
star-representation of data cubes in the graphoid model (Figure 9(a)).

5.2 Graph- and Classic- OLAP Operations Equivalence


A (classical) data cube C over dimensions D1 , ..., Dd with measures µ1 , ..., µm ,
can then be seen as a partial function µ : dom(D1 ) × · · · × dom(Dd ) →

22
[#Location, 11, Antwerp]
1
3 [#InCube, Lego, Antwerp, 1/1/2014, 10]
[#Product, 13, Lego]
[#Sales, 10]
1

2 [#Cube, 11]
[#Time, 12, 1/1/2014]
(a) (b)

Figure 9: Star-representation of the fact (Lego, Antwerp, 1/1/2014; 10) (a).


Petal-representation of the fact (Lego, Antwerp, 1/1/2014; 10) (b).

dom(µ1 ) × · · · × dom(µm ). This function maps each “cell” of the cube to m


values for the measures. A cell of the cube with coordinates (a1 , ..., ad ) ∈
dom(D1 )×· · ·×dom(Dd ), that contains values (c1 , ..., cm ) ∈ dom(µ1 )×· · ·×
dom(µm ), is denoted by (a1 , ..., ad ; c1 , ..., cm ). Below, the “star-representation”
of a data cube in the graphoid model is formally defined.

Definition 13 (Star-graphoid) Let C be a data cube over dimensions


D1 , ..., Dd , with measures µ1 , ..., µm . The star-graphoid of C, denoted Star(C),
is defined as follows.

• For i = 1, ..., d, for each ai ∈ dom(Di ), there is a node of type #Di


with label [#Di , id, ai ], where id is a unique node identifier.

• For each cell (a1 , ..., ad ; c1 , ..., cm ) ∈ dom(D1 ) × · · · × dom(Dd ) →


dom(µ1 )×· · ·×dom(µm ), there are m hyperedges: for each j = 1, ..., m,
there is a hyperedge of type #µj with an empty source node set and
with a target node set consisting of all nodes labelled [#Di , id, ai ], for
i = 1, ..., d, which is labelled [#µj , cj ]. t
u

Now, the main theorem of this section is stated.

Theorem 1 The cube OLAP-operations Roll-Up, Drill-Down, Slice and Dice


can be expressed (or simulated) by OLAP-operations on graphoids.

Proof 1 Let C be a data cube, and let Star(C) be its star-graphoid. The
proof is based on showing that each of the classical OLAP operations Roll-Up,
Drill-Down, Slice and Dice, over C, can be equivalently applied on Star(C).
The semantics for the classical OLAP operations is the one given in [11].

23
Roll-Up. For cube data, a roll-up operation takes as input a data cube
C, a dimension Dc and a level `i in σ(Dc ) and returns the aggregation
of the original cube along Dc up to level `c for all of the input measures
µ1 , ..., µm , using aggregate functions F1 , ..., Fm . Assume, without loss of
generality, that the roll-up starts at the Bottom level, that is, at dom(Dc ).
Also assume, for the sake of clarity of exposition, that m = 1, that is,
that there is only one measure, call it µ, with associated aggregate func-
tion F . Now, it will be shown that the roll-up Roll-Up(C, Dc .`c ; µ, F ) on
the cube C can be simulated on Star(C) by the graphoid OLAP-operation
Roll-Up(Star(C), {#Dc }, Dc .(Bottom → `c ); #eµ ; µ, F ), where #Dc is the
unique node type in Star(C) that contains information on Dc and where
#eµ is the unique edge type that contains measure information on µ.
Let (a1 , ..., ac−1 , ac+1 , ..., ad ) be an element of dom(D1 )×· · · dom(Dc−1 )×
dom(Dc+1 ) × · · · × dom(Dd ) and suppose that there are r values ac,i from
dom(Dc ) (for i = 1, ..., r) such that (a1 , ..., ac−1 , ac,i , ac+1 , ..., ad ; mi ) appear
in the cube C, and such that all ac,i roll-up to the same element, call it aru ,
that means ρDc .Bottomk →`c (a) = aru . The roll-up on C will replace these r
cells by one “new” cell which has coordinates (a1 , ..., ac−1 , aru , ac+1 , ..., ad ) in
dom(D1 ) × · · · dom(Dc−1 ) × dom(Dc .`c ) × dom(Dc+1 ) × · · · × dom(Dd ), and
which contains the aggregated measure F ({m1 , ..., mr }). In Star(C), each
one of these “new” cells will be represented by a hyperedge. To achieve this,
the following graphoid OLAP-operation is performed:

Roll-Up(Star(C), #Dc , Dc .(Bottom → `c ); #eµ , µ, F ).

To see the correctness of this claim, the substeps in the above graph-
oid roll-up are analysed. First, Climb(Star(C), #Dc , Dc .(Bottom → `c )) is
performed; a graphoid called G1 is obtained. Compared against Star(C), in
G1 all nodes and edges remain the same, except for the nodes of type #Dc ,
which now contain values at level `c . Next, a minimisation is performed
(to obtain a grouping on Dc ), which may contract some nodes in G1 into
“roll-up” nodes. Call the resulting graphoid G2 . These roll-up nodes of G2
simulate the “new” cells in the cube that store the aggregate information.
Finally, Aggr(G2 , #eµ , µ, F ) contracts edges that have the same adjacency
set and gives them the aggregated value of µ as attribute value.
Drill-Down. As mentioned above, the drill-down to level ` can be seen as
a roll-up from the Bottom level to level `. Therefore, no proof is needed.
Slice. On data cubes, the Slice-operation takes as input a cube C, a dimen-
sion Ds and returns a cube in which the dimension Ds is dropped, and all
measures are aggregated over the dropped dimension. To drop the dimen-
sion Ds , a roll-up to the level All in this dimension is needed first, such that

24
its domain becomes a singleton. Thus, to simulate this on Star(C) using
graphoid OLAP-operations, a climb to the level All in the dimension Ds is
performed, and therefore the proof of the roll-up case holds, taking into ac-
count that all nodes representing Ds will contain the value “all”. Thus, the
slice of the cube C is simulated by Slice(Star(C), Ds ; µ, F ). There one step
missing, however. When slicing a dimension from a cube C, this dimen-
sion is deleted. In the case of the graphoid Star(C), the nodes of type #Ds
are still present in G1 = Slice(Star(C), Ds ; µ, F ). So, n-Delete(G1 , #Ds ) is
needed to delete these nodes.
Dice. Intuitively, the Dice(C, ϕ) operation, where ϕ is a Boolean condition
over level values and measures, selects the cells in a cube C that satisfy
ϕ. The resulting cube has the same dimensionality as the original cube. It
must be shown that Dice(C, ϕ) can be simulated by s-Dice(Star(C), ϕ). As in
Section 4.6, take _^
ϕ= ϕkl ,
k l

with ϕkl of the form D.` < c, D.` = c or D.` > c, where D is a dimension,
` is a level in that dimension and c ∈ dom(D.`); or µ < c, µ = c or µ > c,
where µ is a measure and c belongs to the domain of that measure.
Let (a1 , ..., ad ; c1 , ..., cm ) ∈ dom(D1 ) × · · · × dom(Dd ) → dom(µ1 ) × · · · ×
dom(µm ) be a cell of C that satisfies ϕ. Denote this by (a1 , ..., ad ; c1 , ..., cm ) |=
ϕ. The proof here requires showing that the edges ej , labelled [#µj , cj ] (that
are adjacent to the nodes [#Di , id, ai ], for i = 1, ..., d), for j = 1, ..., m, also
satisfy ϕ. From (a1 , ..., ad ; c1 , ..., cm ) |= ϕ it follows that there exists a k such
that for all l, (a1 , ..., ad ; c1 , ..., cm ) |= ϕkl holds.
If ϕkl is of the form D.` < c, D.` = c or D.` > c, then ϕkl is undefined
in the edge label and thus, it is not false in it. Furthermore, because of the
particular definition of stars in star-graphoids, where all nodes that are ad-
jacent to an edge ej carry information on unique dimensions, ϕkl is not false
in all adjacent nodes that do not contain information on D.` and it is true
in the unique adjacent node that contains information on D.`. Therefore,
the edge ej satisfies ϕkl .
If ϕkl is of the form µ < c, µ = c or µ > c, then ϕkl evaluates to true
on one of the edges ej (that contains information on that measure µ) and is
undefined on the other edges (that contain information on other measures).
On the adjacent nodes to these edges, the condition ϕkl is not false (since
these nodes do not contain information on any measures). In both cases, all
these edges satisfy ϕkl . This means that the strong dice-operation will keep
all these edges.
By a similar reasoning, it can be shown that when (a1 , ..., ad ; c1 , ..., cm ) 6|=
ϕkl , ej 6|= ϕkl holds.

25
This shows that exactly the edges (labelled [#µj , cj ]) corresponding to
cells (a1 , ..., ad ; c1 , ..., cm ), where ϕ is not satisfied are deleted from the graph-
oid Star(C) by the strong dice-operation. This completes the proof.

6 Case Study and Discussion


The running example followed so far in this paper will also be used as a case
study, in order to evaluate the hypergraph model against the traditional
relational OLAP alternative. The example case has many interesting char-
acteristics, such as: (a) Normally it involves huge volumes of data facts (i.e.,
calls); (b) The number of dimensions involved in facts is variable, since calls
may differ from one another in the number of participants; (c) It allows per-
forming not only the typical OLAP operations described in Section 4, over
the fact measures, but also to aggregate the graph elements using graph mea-
sures like shortest paths, centrality, and so on. Therefore, the case study is
appropriate for illustrating and discussing the graphoid model usefulness in
two situations: (a) The classic OLAP scenario, where the relational model
is normally used; and (b) A Graph OLAP scenario, where graph metrics
are aggregated. The hypothesis to be tested here is that, although the re-
lational OLAP alternative works better in scenario (a), when facts have a
fixed dimensionality (e.g., when all calls in the database involve the same
number of participants), the graphoid model is competitive when the num-
ber of dimensions is variable, and definitely better for scenario (b), where
queries compute aggregations over graph metrics.
The dataset to analyse consists of group calls between phone lines, where
a line cannot call itself, and the analyst also needs to identify the line who
started the call. The schemas of the background dimensions are the ones in
Figure 1, with small changes that will be explained below. Facts are similar
to the ones in Figure 3.
Although performing an exhaustive experimental study is beyond the
scope of this paper, and will be part of future work, this section aims at
analysing the plausibility of the graph model to become a better solution
that the relational model for the kinds of problems where factual data are
naturally represented as graphs. For this, the graphoid model are com-
pared against the relational alternative containing exactly the same data.
First, two alternative relational OLAP representations are implemented on
a PostgreSQL database, and three synthetic datasets of different sizes are
produced and loaded into both representations. Then, the same datasets are
loaded into a graph database. Neo4j is used for this purpose, and queries

26
are written in Cypher, Neo4j’s high level query language.7

6.1 Relational Representation


Since the relational design may impact in query performance, two alter-
native designs for the fact table are implemented in order to provide a
fair comparison. In both cases, the fact table schema is the following:
Calls(CallId, CallerId, Participant,StartTime, EndTime, Duration).
The meaning of the attributes is:

• CallId: Call identifier;

• CallerId: The identifier of the line which initiated the call;

• StartTime, EndTime: Initial and final instants of the call;

• Duration: Attribute precomputed as (StartTime - EndTime).

Although the schemas are the same in both cases, the instances differ
from each other. In one case, a call between phone P h1 , P h2 , and P h3 ,
initiated by P h1 , contains the tuples (1 , P h1 , P h2 ) and (1 , P h1 , P h3 ). In
the other case, a tuple (1 , P h1 , P h1 ) is added to the other two to indicate
that P h1 started the call. This makes a difference for queries where the user
is not interested in who did initiate the call. In what follows, both relational
representations are denoted Calls and Calls-alt, respectively.
As expressed above, the background dimensions are the same of Figure 1.
There are two slight differences, however, for practical reasons. First, for
the Time dimension, the bottom level has granularity Timestamp, since the
StartTime and EndTime attributes in the fact tables have that granularity.
That means, a new level is added to the dimension. Second, in the Phone
dimension the bottom level is the phone identifier, denoted Id, which rolls
up to the line number, denoted Number. This is because the caller and the
callee are represented as integers, as usual in real world data warehouses.
The Phone dimension is represented in a single table, keeping the constraints
indicated by the hierarchies. This representation (i.e., Star) was chosen
to provide a fair comparison. In summary, the dimension table schema is
Phone(Id, Number, Customer, City, Country, Operator).

6.2 Graphoid-OLAP Representation


The logical model for the graphoid representing the calls (i.e., the base
graphoid), is similar to the one depicted in Figure 3. There are two main
7
https://neo4j.com/developer/cypher-query-language/

27
Table 1: Dataset sizes for the relational representation
Dataset tuples Calls tuples Calls-alt calls tuples Phone
D1 293,817 420,517 126,700 793
D2 528,408 756,117 227,709 4,689

Table 2: Dataset sizes for the graph representation


Dataset Phone nodes User nodes Call nodes creator edges receiver edges
D1 793 500 126,700 126,700 293,817
D2 4,689 3,000 227,710 227,709 528,408

entity nodes, namely #Phone and #Call, to represent call facts. These are
linked through edges labelled #creator and #receiver, the former going from
the phone that initiated the call, to the node representing such call. Back-
ground dimensions are represented in the same graph, using the entity nodes
#Operator, #User, #City and #Country for the dimension levels. Finally, di-
mension levels are linked using the edges of types #provided by, #has phone,
#belongs to and #lives in. It can be observed that nodes are not duplicated.

6.3 Datasets
For the relational representation, synthetic datasets of two different sizes
are generated and loaded into a PostgreSQL database. Table 1 depicts the
sizes of the datasets. The first column shows the number of tuples in the
Calls fact table. The second column shows the number of tuples in the Calls-
alt fact table. The third column indicates the number of calls (only one
column, since the number of calls is the same in both versions), and the
fourth column tells the number of tuples in the Phone dimension table.
For the graph representation, Table 2 depicts the main numbers of
elements in the Neo4j graph.

6.4 Queries
This section shows how different kinds of complex analytical queries can be
expressed and executed over the three representations described above. Four
kinds of OLAP queries are discussed: (a) Queries where the aggregations are
performed for pairs of objects (e.g., phone lines, persons, etc.); (b) Queries
where aggregations are performed in groups of N objects, where N > 2; (c)
For (a) and (b), rollups to different dimension levels are performed.; (d)

28
Graph OLAP-style aggregations performed over graph metrics. The idea of
these experiments is to study if, when the queries can take advantage of the
graph structure, graphoid-OLAP queries are more concisely expressed, and
more efficiently executed. The impact of N in the relational and the graph
representation is also studied. The queries are described next. For the sake
of space, only some of the SQL and Neo4j queries are shown.

Query 1 Average duration of the calls between groups of N phone lines.

This query computes all the N -subsets of lines that participated in some
call. That means, if a call involves 3 lines, say P h1 , P h2 and P h3 , and N = 2,
the groups will be (P h1 , P h2 ), (P h1 , P h3 ), and (P h2 , P h3 ). Figure ?? shows
the recursive SQL query for the first representation alternative.

Query 2 Average duration of the calls between groups of N users.

Query 3 Average duration of the calls between groups of N operators.

This analyses a roll-up to the level Operator, which has less instance mem-
bers than the level User addressed in Query 2.

Query 4 For each pair of Phones in the Calls graph, compute the shortest
path between them.

This query aims at analysing the connections between phone line users,
and has many real-world applications (for example, to investigate calls made
between two persons who use a third one as an intermediary). From a
technical point of view, this is an aggregation over the whole graph, using
as a metric the shortest path between every pair of nodes.
Finally, the following queries combine the computation of graph metrics
together with roll-up and dice operations.

Query 5 Compute the shortest path between pairs (p1 , p2 ) of phone lines,
such that p1 corresponds to operator “Claro” and p2 corresponds to operator
“Movistar”.

Query 6 Compute the shortest path between pairs (p1 , p2 ) of phone lines,
such that p1 corresponds to a user from the city of Buenos Aires and p2
corresponds to a user from the city of Salta.

Query 7 Compute the shortest path between pairs (p1 , p2 ) of phone lines,
such that p1 corresponds to a user from the city of Buenos Aires.

29
6.5 Results
Table 3 shows the results of the experiments. The tests were ran on machine
with a i7-6700 processor and 12 GB of RAM, and 250GB disk (actually, a
virtual node in a cluster). The execution times are depicted, and are the
averages of five runs of each experiment, expressed in seconds. The winning
alternatives are marked in boldface, for clarity.

Table 3: Experimental results (running times in seconds).

Dataset Calls Calls Calls Calls-alt Calls-alt Calls-alt Neo4j Neo4j Neo4j
N =2 N =3 N =4 N =2 N =3 N =4 N =2 N =3 N =4
D1-Q1 4.9 7.6 9.5 5.4 8.7 10.6 7.3 11.2 12.5
D1-Q2 4.6 11.7 12.9 4.4 12.3 14.5 7 11.7 14.8
D1-Q3 6.6 7.3 11.5 12.8 12.6 14.7 3.7 10.8 15.5
D1-Q4 ∞ N/A N/A ∞ N/A N/A 185 N/A N/A
D1-Q5 ∞ N/A N/A ∞ N/A N/A 21 N/A N/A
D1-Q6 ∞ N/A N/A ∞ N/A N/A 6 N/A N/A
D1-Q7 ∞ N/A N/A ∞ N/A N/A 34 N/A N/A
D2-Q1 9.3 14.1 15.1 10.4 16.2 17.7 15.6 17.5 21.6
D2-Q2 12.9 19 20.7 14.5 24 26.8 20.2 21.6 24.8
D2-Q3 12.5 19.4 22.2 14.3 14.6 22.8 9.3 18.7 28.4
D2-Q4 ∞ N/A N/A ∞ N/A N/A ∞ N/A N/A
D2-Q5 ∞ N/A N/A ∞ N/A N/A 677 N/A N/A
D2-Q6 ∞ N/A N/A ∞ N/A N/A 123 N/A N/A
D2-Q7 ∞ N/A N/A ∞ N/A N/A 924 N/A N/A

6.6 Discussion of Results


In Table 3 it can be seen that running traditional OLAP queries, like Query
1, Query 2 and Query 3, takes approximately the same time in the relational
and graphoid models, with a slight advantage for the former. Further, it can
be seen that for Queries 2 and 3, which include a roll-up, results are very
similar, and even Neo4j wins here in some cases. In Query 1, which is an
aggregation over the fact graph, the relational alternatives work better.8
However, for typical Graph OLAP queries (Queries 4 through 7), which ag-
gregate graph metrics, the graph model shows a dramatical advantage over
the relational alternative. For Neo4j, Query 4 does not finish within a rea-
8
It is worth noting that Neo4j (and graph databases in general) is a novel database,
whose query optimization strategy is still very basic. On the contrary, relational databases
are mature technologies, and query optimization is very efficient indeed. Further, for the
experiments presented here, the PostgreSQL databases have been tuned to perform in the
best possible way. In this sense, Neo4j’s performance for typical OLAP queries is, in some
sense, penalized.

30
sonable time for the largest of the two datasets (D2) but performance is
acceptable for D1. On the other hand, the relational alternatives do not
terminate successfully neither for D1 nor for D2. It is important to make
it clear that with an ad-hoc relational design, specifically for graph repre-
sentation, it is possible that the performance of the relational alternative
for shortest path aggregations could be improved, although it will hardly be
close to the graph alternative, given the results presented here. However,
the intention of this paper is to present a flexible model that can perform
efficiently on a variety of situations. In this sense, the tests presented here
suggest that the graphoid data model can be competitive with the relational
model for classic OLAP queries, but is much better for typical Graph OLAP
ones.

7 Conclusion and Open Problems


This paper presented a data model for graph analysis based on node- and
edge-labelled directed multi-hypergraphs, called graphoids. A collection of
OLAP operations, analogous to the ones that apply to data cubes, was
formally defined over graphoids. It was also formally proved that the classic
data cube model is a particular case of the graphoid data model. As far as
the authors are aware of, this is the first proposal that formally addresses the
problem of defining OLAP operations over hypergraphs. Supported by this
proof, it was shown that the graphoid model can be competitive with the
relational implementation of OLAP, but clearly much better when graph
operations are used to aggregate graphs. This feature allows devising a
general OLAP framework that may cope with the flexible needs of modern
data analysis, where data may arrive in different forms. It is worth to
remark, once more, that the experiments presented do not pretend to be
exhaustive, but a good general indication of the plausibility of the approach,
and it is clear that the graph data model provides OLAP with a machinery
of more powerful tools than the classic cube data model, which is already
good news for the OLAP practitioners.
Building on the results in this paper, future work includes looking for
further graph metrics that can be applied to the graphoid model, new case
studies, and the study of query optimization strategies. Moreover, the ap-
proach can also benefit from tools supporting parallel computation with
columnar databases as backends. This can further improve the relational
OLAP computation, while keeping the properties of the graphoid model for
Graph OLAP queries.

31
Acknowledgments
Alejandro Vaisman was supported by a travel grant from Hasselt University
(Korte verblijven–inkomende mobiliteit, BOF16KV09). He was also par-
tially supported by PICT-2014 Project 0787 and PICT-2017 Project 1054.
The authors also thank T. Colloca, S. Ocamica, J. Perez Bodean, and N.
Castaño, for their collaboration in the data preparation for the experiments.

References
[1] R. Angles. A Comparison of Current Graph Database Models. In
Proceedings of ICDE Workshops, pages 171–177, Arlington, VA, USA,
2012.

[2] R. Angles, M. Arenas, P. Barceló, A. Hogan, J. L. Reutter, and D. Vr-


goc. Foundations of modern query languages for graph databases. ACM
Comput. Surv., 50(5):68:1–68:40, 2017.

[3] C. Chen, X. Yan, F. Zhu, J. Han, and P. Yu. Graph OLAP: a multi-
dimensional framework for graph data analysis. Knowl. Inf. Syst.,
21(1):41–63, 2009.

[4] J. Cohen, B. Dolan, M. Dunlap, J.M. Hellerstein, and C. Welton. MAD


Skills: New analysis practices for big data. Proceedings of the VLDB
Endowment, 2(2):1481–1492, 2009.

[5] Alfredo Cuzzocrea, Ladjel Bellatreche, and Il-Yeol Song. Data Ware-
housing and OLAP over Big Data: Current Challenges and Future Re-
search Directions. In Proceedings of DOLAP, pages 67–70, New York,
NY, USA, 2013. ACM.

[6] Leticia I. Gómez, Bart Kuijpers, and Alejandro A. Vaisman. Performing


OLAP over graph data: Query language, implementation, and a case
study. In Proceedings of BIRTE, Munich, Germany, August 28, 2017,
pages 6:1–6:8, 2017.

[7] Leticia I. Gómez, Bart Kuijpers, and Alejandro A. Vaisman. Analytical


queries on semantic trajectories using graph databases. TGIS Trans.
Geog. Inf. Syst., 23(5), 2019.

[8] O. Hartig. Reconciliation of RDF* and property graphs. CoRR,


abs/1409.3288, 2014.

[9] Ralph Kimball. The Data Warehouse Toolkit. J. Wiley and Sons, 1996.

32
[10] M. B. Kraiem, J. Feki, K. Khrouf, F. Ravat, and O. Teste. Modeling
and OLAPing social media: the case of twitter. Social Netw. Analys.
Mining, 5(1):47:1–47:15, 2015.

[11] Bart Kuijpers and Alejandro A. Vaisman. An algebra for OLAP. In-
telligent Data Analysis, 21(5), 2017.

[12] N. U. Rehman, A. Weiler, and M. H. Scholl. OLAPing social media: the


case of twitter. In Advances in Social Networks Analysis and Mining
2013, ASONAM ’13, pages 1139–1146, Niagara, ON, Canada, 2013.

[13] I. Robinson, J. Webber, and Emil Eifrém. Graph Databases. O’Reilly


Media, 2013.

[14] Bo Tang, Shi Han, Man Lung Yiu, Rui Ding, and Dongmei Zhang.
Extracting top-k insights from multi-dimensional data. In Proceedings
of ACM SIGMOD, Chicago, IL, USA, May 14-19, 2017, pages 1509–
1524, 2017.

[15] A. A. Vaisman and E. Zimányi. Data Warehouse Systems: Design and


Implementation. Springer, 2014.

[16] Alejandro Vaisman, Florencia Besteiro, and Maximiliano Valverde.


modelling and querying star and snowflake warehouses using graph
databases. In Proceedings of ADBIS Conference 2019, Bled, Slovenia,
Sept. 8-11, 2019, 2017.

[17] Z. Wang, Q. Fan, H. Wang, K-L. Tan, D. Agrawal, and A. El Abbadi.


Pagrol: Parallel graph OLAP over large-scale attributed graphs. In
Proceeding of IEEE ICDE, pages 496–507, 2014.

[18] Peixiang Zhao, Xiaolei Li, Dong Xin, and Jiawei Han. Graph Cube: on
warehousing and OLAP multidimensional networks. In Proceedings of
ACM SIGMOD, pages 853–864. ACM, 2011.

33

View publication stats

You might also like