Cloud-Based RDF Data Management
Cloud-Based RDF Data Management
KAOUDI • ET AL
Cloud-Based RDF
Series Editor: H.V. Jagadish, University of Michigan
Founding Editor: M. Tamer Özsu, University of Waterloo
Ioana Manolescu, INRIA
Stamatis Zampetakis, TIBCO Orchestra Networks
architectures for the scalability, fault tolerance, and elasticity features it provides. At the same time,
interest in massively parallel processing has been renewed by the MapReduce model and many follow-up
works, which aim at simplifying the deployment of massively parallel data management tasks in a cloud
environment.
In this book, we study the state-of-the-art RDF data management in cloud environments and
parallel/distributed architectures that were not necessarily intended for the cloud, but can easily be deployed
therein. After providing a comprehensive background on RDF and cloud technologies, we explore four
Zoi Kaoudi
aspects that are vital in an RDF data management system: data storage, query processing, query optimization,
and reasoning. We conclude the book with a discussion on open problems and future directions. Ioana Manolescu
About SYNTHESIS Stamatis Zampetakis
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
books provide concise, original presentations of important research
store.morganclaypool.com
H.V. Jagadish, Series Editor
Cloud-Based
RDF Data Management
Synthesis Lectures on Data
Management
Editor
H.V. Jagadish, University of Michigan
Founding Editor
M. Tamer Özsu, University of Waterloo
Synthesis Lectures on Data Management is edited by H.V. Jagadish of the University of Michigan.
The series publishes 80–150 page publications on topics pertaining to data management. Topics
include query languages, database system architectures, transaction management, data
warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia
data management, data mining, and related subjects.
Data Profiling
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock
2018
Querying Graphs
Angela Bonifati, George Fletcher, Hannes Voigt, and Nikolay Yakovets
2018
On Uncertain Graphs
Arijit Khan, Yuan Ye, and Lei Chen
2018
Instant Recovery with Write-Ahead Logging: Page Repair, System Restart, and Media
Restore
Goetz Graefe, Wey Guy, and Caetano Sauer
2014
Semantics Empowered Web 3.0: Managing Enterprise, Social, Sensor, and Cloud-based
Data and Services for Advanced Applications
Amit Sheth and Krishnaprasad Thirunarayan
2012
Probabilistic Databases
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch
2011
Database Replication
Bettina Kemme, Ricardo Jimenez-Peris, and Marta Patino-Martinez
2010
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00986ED1V01Y202001DTM062
Lecture #62
Series Editor: H.V. Jagadish, University of Michigan
Founding Editor: M. Tamer Özsu, University of Waterloo
Series ISSN
Print 2153-5418 Electronic 2153-5426
Cloud-Based
RDF Data Management
Zoi Kaoudi
Technische Universität Berlin
Ioana Manolescu
INRIA
Stamatis Zampetakis
TIBCO Orchestra Networks
M
&C Morgan & cLaypool publishers
ABSTRACT
Resource Description Framework (or RDF, in short) is set to deliver many of the original semi-
structured data promises: flexible structure, optional schema, and rich, flexible Universal Re-
source Identifiers as a basis for information sharing. Moreover, RDF is uniquely positioned to
benefit from the efforts of scientific communities studying databases, knowledge representation,
and Web technologies. As a consequence, the RDF data model is used in a variety of applica-
tions today for integrating knowledge and information: in open Web or government data via the
Linked Open Data initiative, in scientific domains such as bioinformatics, and more recently in
search engines and personal assistants of enterprises in the form of knowledge graphs.
Managing such large volumes of RDF data is challenging due to the sheer size, hetero-
geneity, and complexity brought by RDF reasoning. To tackle the size challenge, distributed ar-
chitectures are required. Cloud computing is an emerging paradigm massively adopted in many
applications requiring distributed architectures for the scalability, fault tolerance, and elasticity
features it provides. At the same time, interest in massively parallel processing has been renewed
by the MapReduce model and many follow-up works, which aim at simplifying the deployment
of massively parallel data management tasks in a cloud environment.
In this book, we study the state-of-the-art RDF data management in cloud environments
and parallel/distributed architectures that were not necessarily intended for the cloud, but can
easily be deployed therein. After providing a comprehensive background on RDF and cloud
technologies, we explore four aspects that are vital in an RDF data management system: data
storage, query processing, query optimization, and reasoning. We conclude the book with a
discussion on open problems and future directions.
KEYWORDS
RDF, cloud computing, MapReduce, key-value stores, query optimization, reason-
ing
xi
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Resource Description Framework (RDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 The SPARQL Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Distributed Storage and Computing Paradigms . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Distributed File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Distributed Key-Value Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Distributed Computation Frameworks: MapReduce and Beyond . . . 17
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
1
CHAPTER 1
Introduction
The Resource Description Framework (RDF) [W3C, 2004] first appeared in 2004 to realize
the vision of the Semantic Web [Berners-Lee et al., 2001]. The goal of Semantic Web was to
evolve the Web in order to accommodate intelligent, automatic processes that perform tasks on
behalf of the users utilizing machine-readable data. This data should have well-defined semantics
enabling better data integration and interoperability. RDF provided the standardized means of
representing this structured and meaningful information on the Web.
Today a vast amount of data is available online accommodating many aspects of human
activities, knowledge, and experiences. RDF provides a simple and abstract knowledge repre-
sentation for such data on the Web which are uniquely identified by Universal Resource Identi-
fiers (URIs). RDF Schema (RDFS) [W3C, 2014b] is the vocabulary language of RDF. It gives
meaning to resources, groups them into concepts and identifies the relationships between these
concepts. Web Ontology Language (OWL) [W3C OWL Working Group, 2012] can also be
used for conceptualization and provides further expressiveness in stating relationships among
the resources. Ontology languages, such as RDFS and OWL, allow for deriving entailed infor-
mation through reasoning. For instance, one can reason that any student is also a human, or that
if X worksWith Y , then X also knows Y ; or that if X drives car Z , then X is a human and Z
is a vehicle. Finally, to be able to explore and query structured information expressed in RDF,
SPARQL [W3C, 2013] has been the official W3C recommendation language since 2008.
RDF is used today in a variety of applications. A particularly interesting one comes from
the Open Data concept that “certain data should be freely available to everyone to use and republish
as they wish, without restrictions from copyright, patents or other mechanisms of control.”1 Open Data
federates players of many roles, from organizations such as business and government aiming at
demonstrate transparency and good (corporate) governance, to end users interested in consum-
ing and producing data to share with others, to aggregators that may build business models
around warehousing, curating, and sharing this data [Raschia et al., 2012]. Sample governmen-
tal Open Data portals are the ones from the U.S.,2 UK,3 and France.4 At the same time, if
Open Data designates a general philosophy, Linked Data refers to the “recommended best practice
for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web
using URIs and RDF” [Berners-Lee, 2006]. In practice, Open and Linked data are frequently
1 http://en.wikipedia.org/wiki/Open_data
2 www.data.gov
3 www.data.gov.uk
4 www.etalab.fr
2 1. INTRODUCTION
combined to facilitate data sharing, interpretation, and exploitation LOD. Sample applications
of Linked Open Data are DBPedia (the Linked Data version of Wikipedia), BBC’s platform
for the World Cup 2010 and the 2012 Olympic games [Kiryakov et al., 2010].
In addition, RDF is the main data model behind private or public knowledge graphs
(e.g., DBPedia [Auer et al., 2007], YAGO [Suchanek et al., 2007], Wikidata5 (previously Free-
base)). Knowledge graphs have been increasingly used recently by enterprises to facilitate and
enhance the functionality of their products. Knowledge graphs are simply graphs that connect
entities via their relationships. For example, Google uses a knowledge graph in its search engine
and personal assistant, Microsoft built a knowledge graph [Gao et al., 2018] used in its products
(e.g., Bing and Cortana), and Walmart [Deshpande et al., 2013] and Amazon [Dong, 2018] use
knowledge graphs in a variety of applications such as product search and advertising.
To exploit large volumes of RDF data, one could try to build a centralized ware-
house. Some of the very first systems that appeared in the Semantic Web community include
Jena [Wilkinson et al., 2003] and Sesame [Broekstra and Kampman, 2002]. Later on, RDF-
based stores had gained interest in the database community as well, as illustrated by the works
of Abadi et al. [2009], Neumann and Weikum [2010b] and Weiss et al. [2008b]. Moreover,
commercial database management systems also had started providing support for RDF, such as
Oracle 11g [Chong et al., 2005] or IBM DB2 10.1 [Bornea et al., 2013]. These works mostly
focused on RDF viewed as a relational database on which to evaluate conjunctive queries and
do not consider RDF-specific features such as those related to reasoning. A different line of
research focused on viewing RDF as a graph and exploited graph models for the indexing and
storage and subgraph matching for querying [Udrea et al., 2007, Zou et al., 2014].
Large and increasing data volumes have also raised the need for distributed storage archi-
tectures. Past works on distributed RDF query processing and reasoning have relied on peer-
to-peer platforms [Kaoudi and Koubarakis, 2013, Kaoudi et al., 2010] or clustered architec-
tures [Erling and Mikhailov, 2009, Harris et al., 2009, Owens et al., 2008]. Most of these
approaches have been proved inadequate to scale to the large amounts of RDF data that we
encounter nowadays. Peer-to-peer architectures suffer more on long latency during query eval-
uation because of the many communication steps required when exchanging large amounts of
data and the geo-distribution of peers. Clustered architectures, on the other hand, requires very
fine-grained tuning of the cluster and have long loading times.
Cloud computing is an emerging paradigm massively adopted in many applications for
the scalability, fault tolerance and elasticity features it offers, which also allows for effortless
deployment of distributed and parallel architectures. At the same time, interest in massively
parallel processing has been renewed by the MapReduce model [Dean and Ghemawat, 2004]
and many follow-up works, which aim at simplifying the deployment of massively parallel data
management tasks in a cloud environment. For these reasons, cloud-based stores are an inter-
esting avenue to explore for handling very large volumes of RDF data.
5 https://www.wikidata.org
1. INTRODUCTION 3
The main goal of this book is to study state-of-the-art RDF data management in a cloud
environment. It also investigates the most recent advances of RDF data management in par-
allel/distributed architectures that were not necessarily intended for the cloud, but can easily be
deployed therein. We provide a description of existing systems and proposals which can handle
large volumes of RDF data while classifying them along different dimensions and highlight-
ing their limitations and opportunities. We start by identifying four dimensions according to
the way in which systems implement four fundamental functionalities: data storage, query pro-
cessing, query optimization, and reasoning. Then, within each dimension we classify each system
according to their basic characteristics.
The remainder of this book is organized as follows. We start with Chapter 2 by introduc-
ing the main features of RDF and its accompanying schema language RDFS. The same chapter
gives an overview of the cloud-based frameworks and tools used up through today for RDF data
management. In Chapter 3 we present current approaches on RDF data storage and in Chap-
ter 4 we describe different query processing paradigms for evaluating RDF queries. Chapter 5
describes the state-of-the-art in query optimization used for cloud-based query evaluation. In
Chapter 6 we lay out the state-of-the-art in RDFS reasoning on top of cloud platforms. Finally,
we conclude in Chapter 7 and give insights into open problems and directions.
5
CHAPTER 2
Preliminaries
This chapter introduces the main concepts of RDF and its accompanying schema language
RDFS. It additionally describes the main characteristics of the distributed paradigms and frame-
works used in the cloud that have been used in building RDF data management systems.
Definition 2.1 RDF Triple. Let U be a set of URIs, L be a set of literals, and B be a set of
blank nodes. A well-formed RDF triple is a tuple .s p o/ from .U [ B/ U .U [ L [ B/.
The syntactic conventions for representing valid URIs, literals, and blank nodes can be
found in RDF Concepts. In this book, literals are shown as strings enclosed by quotation marks,
while URIs are shown as simple strings (see also discussion on namespaces below).
RDF admits a natural graph representation, with each .s p o/ triple seen as an p -labeled
directed edge from the node identified by s to the node identified by o.
We use val.G/ to refer to the values (URIs, literals, and blank nodes) of an RDF graph
G.
6 2. PRELIMINARIES
:sculptor :subClassOf :artist .
:painter :subClassOf :artist .
:cubist :subClassOf :painter.
:paints :subPropertyOf :creates.
:creates :domain :artist.
:creates :range :artifact.
:picasso :type :cubist .
:picasso :name "Pablo" .
:picasso :paints :guernica .
:guernica :type :artifact .
:guernica :exhibited :reinasofia .
:reinasofia :located :madrid .
:rodin :type :sculptor .
:rodin :name "Auguste" .
:rodin :creates :thethinker .
:thethinker :exhibited :museerodin .
:museerodin :located :paris .
For instance, Figure 2.1 depicts an RDF graph in the so-called N-Triples syntax, while
Figure 2.2 shows a graphical representation of the same graph.
In some cases, we need to work with several RDF graphs while still being able to distin-
guish the graph each triple originates from. We achieve this by considering named RDF graphs
where each graph is associated with a name that can be a URI or a blank node. The notion of
an RDF triple is extended as follows to capture these needs.
Definition 2.3 RDF quad. Let U be a set of URIs, L be a set of literals, and B be a set of
blank nodes. A well-formed RDF quad is a tuple .s p o g/ from .U [ B/ U .U [ L [ B/
.U [ B/.
We are now able to capture multiple RDF graphs using the notion of an RDF dataset.
Definition 2.4 An RDF dataset is a set of RDF quads.
An RDF dataset may contain only a single graph, in which case all the quads of the form
.s p o g/ have the same value for g . In such cases, we may use the term RDF graph and RDF
dataset interchangeably.
Namespaces are supported in RDF as a means to support flexible choices of URIs as well
as interoperability between different datasets. A namespace typically serves to identify a certain
application domain. Concretely, a namespace is identified by a URI, which is used as a pre-
fix of all URIs defined within the respective application domain. Thus, for instance, the URI
http://www.w3.org/1999/02/22-rdf-syntax-ns is chosen by the W3C to represent the
2.1. RESOURCE DESCRIPTION FRAMEWORK (RDF) 7
:artist
:domain
:subClassOf :subClassOf
“Pablo”
domain of a small set of predefined URIs which are part of the RDF specification itself; or, for
instance, http://swat.cse.lehigh.edu/onto/univ-bench.owl is used by the University of
Lehigh to identify its domain of representation. To denote that the URI of a resource r is part
of the application domain identified by a namespace URI u, the URI of u is a prefix of the
URI of r . The suffix of r ’s URI is typically called local name; it uniquely identifies r among all
the resources of the namespace u. This enables other application domains to use the same local
name in conjunction with their respective namespace URIs without causing confusions between
the two. On the other hand, when one wishes to refer in a dataset to a specific resource from a
specific namespace, the full URI (including the namespace URI prefix) must be used.
While the above mechanism is flexible, it leads to rather lengthy URIs, which increase
the space occupancy of a dataset. To solve this problem, within an RDF graph, a local namespace
prefix is associated with a namespace URI and serves as a shorthand to represent the latter. Thus,
URIs are typically of the form nsp :ln, where nsp stands for the local namespace prefix while ln
represents the local name.
Resource descriptions can be enhanced by specifying to which class(es) a given resource
belongs by means of the pre-defined rdf:type property which is part of the RDF specification.
For instance, the RDF Graph in Figure 2.1 features the classes :artist, :painter, :cubist,
etc., and the resource :picasso is stated to be of type :cubist.
Further, the RDF Schema [W3C, 2014b] specification allows relating classes and prop-
erties used in a graph through ontological (i.e., deductive) constraints expressed as triples using
built-in properties:
8 2. PRELIMINARIES
Table 2.1: Deductive constraints expressible in an RDF Schema
follows that :rodin is of type :artist. Finally, rule i4 allows us to infer that a resource o is of type
c if o is a value of a property p whose range is c . For example, given that :thethinker is created
by someone, and knowing that the range of :creates is :artifact, we can infer that :thethinker
is an :artifact.
Note that in this example, there were two independent ways of inferring that :rodin
rdf:type :artist, one based on rule i3 as explained above, and another one based on the rule
i1 , knowing that :rodin is a :sculptor and that every sculptor is an artist. More generally, al-
though in our example this is not the case, the RDF graph may even have contained the explicit
fact :rodin rdf:type :artist. Thus, in an arbitrary RDF graph, the same fact may be present
only explicitly, or only implicitly, and there may be several ways to infer the same fact from those
explicitly present in the graph.
Observe that unlike the traditional setting of relational databases, RDF Schema con-
straints are expressed with RDF triples themselves and are part of the RDF graph (as opposed
to relational schemas being separated from the relational database instances). Within an RDF
dataset, the term fact is commonly used to denote a triple whose property is not one of the
predefined RDF Schema properties.
Definition 2.5 RDFS Closure. The RDFS closure of an RDF graph G , denoted G 1 , is
obtained by adding to G all the implicit triples that derive from consecutive applications of the
entailment rules on G until a fixpoint is reached.
It has been shown that under RDF Schema constraints, the closure of an RDF graph
is finite and unique (up to blank node renaming) [Muñoz et al., 2009, ter Horst, 2005b]. For
instance, Figure 2.3 depicts the RDFS closure of the RDF graph shown in Figure 2.1. In Fig-
ure 2.3, the inferred triples that were not already present in the initial RDF graph are shown in
italic font.
10 2. PRELIMINARIES
:sculptor :subClassOf :artist .
:painter :subClassOf :artist .
:cubist :subClassOf :painter.
:cubist :subClassOf :painter .
:paints :subPropertyOf :creates.
:creates :domain :artist.
:creates :range :artifact.
:picasso :type :cubist .
:picasso :type :painter .
:picasso :type :artist .
:picasso :name "Pablo" .
:picasso :paints :guernica .
:picasso :creates :guernica .
:guernica :type :artifact .
:guernica :exhibited :reinasofia .
:reinasofia :located :madrid .
:rodin :type :sculptor .
:rodin :type :artist .
:rodin :name "Auguste" .
:rodin :creates :thethinker .
:thethinker :type :artifact .
:thethinker :exhibited :museerodin .
:museerodin :located :paris .
Definition 2.6 Triple Pattern. Let U be a set of URIs, L be a set of literals and V be a set of
variables, a triple pattern is a tuple .s p o/ from .U [ V / .U [ V / .U [ L [ V /.
2.1. RESOURCE DESCRIPTION FRAMEWORK (RDF) 11
SELECT ?a ?c
WHERE { ?c :creates ?a . ?a :exhibited ?m . ?m :locatedIn :paris . }
Triple patterns are used to specify queries against a single RDF graph. Going for-
ward, when no confusion arises, we refer to BPG patterns as triple patterns or even simply
atoms/triples.
Based on triple (or quad) patterns, one can express SPARQL BGP queries as below.
Definition 2.7 BGP Query. A BGP query is an expression of the form
SELECT ?x1 ,…, ?xm WHERE { t1 ,…,tn }
where t1 ; : : : ; tn are triple patterns and ‹x1 ; : : : ; xm are distinguished variables appearing in
t1 ; : : : ; tn . We also define the size of the query as its number of distinct triple patterns.
Alternatively, for ease of presentation, BGP queries can be represented using the equiva-
lent conjunctive query notation, e.g., the query appearing in Definition 2.7 could be denoted as
q.x1 ; : : : ; xm / t1 : : : tn .
In the remainder we use the terms RDF query, SPARQL query, and BGP query, inter-
changeably referring to the SPARQL fragment described by Definition 2.7. Furthermore, we
use varq (resp. varti ) to refer to the variables of a query q (resp. an atom ti ). Additionally, we
refer to the set of head variables of q as headvar.q/.
An example BGP query, asking for the resources exhibited in Paris and their creators, is
shown in Figure 2.4.
Definition 2.8 Valid Assignment. Let q be a BGP query, and G be an RDF graph, W
var.q/ ! val.G/ is a valid assignment iff 8ti 2 q , ti 2 G where we denote by ti the result
of replacing every occurrence of a variable e 2 var.q/ in the triple pattern ti by the value
.e/ 2 val.G/.
Definition 2.9 Result Tuple. Let q be a BGP query, G be an RDF graph, be a valid as-
signment, and xN D headvarq , the result tuple of q based on , denoted as res.q; /, is the tuple:
res.q; / D f.x1 /; : : : ; .xm / j x1 ; : : : ; xm 2 xg
N
Definition 2.10 Query Evaluation. Let q be a BGP query, and G be an RDF graph, the
evaluation of q against G is:
q.G/ D fres.q; / j W var.q/ ! val.G/ is a valid assignmentg
12 2. PRELIMINARIES
where res.q; / is the result tuple of q based on .
The evaluation of query QA (shown in Figure 2.4) against the graph of Figure 2.1 is the
tuple :rodin :thethinker.
Query evaluation only accounts for triples explicitly present in the graph. If entailed triples
are not explicitly in the graph, evaluating the query may miss some results which would have
been obtained otherwise. For instance, consider the query:
SELECT ?a
WHERE { ?a rdf:type?artifact . }
Evaluating the query on the graph in Figure 2.1 returns :guernica but not :thethinker
because the latter is not explicitly of type :artifact. Instead, query answering on the RDFS
closure of the graph depicted in Figure 2.3 leads to two answers :guernica and :thethinker.
The following definition allows capturing results due to both the explicit and implicit
triples in an RDF graph:
Definition 2.11 Query Answering. The answer of a BGP query q over an RDF graph G is the
evaluation of q over G 1 .
It is worth noting that while the relational SPJ queries are most often used with set se-
mantics, SPARQL, just like SQL, has bag (multiset) semantics.
An alternative strategy to RDFS closure for enabling complete query answering w.r.t. im-
plicit information is known as query reformulation. Query reformulation expands the original
query into a reformulated one whose answer over the initial graph is complete.
Definition 2.12 Query Reformulation. Given a query q and an RDF graph G , a query q ref
is a reformulation of q w.r.t. the RDFS constraints of G , iff q.G 1 / D q ref .G/.
In the UCQ, all expansions of the original term ?a rdf:type:artifact are evaluated, and
a union of their results is returned. For this very simple one-atom query, the UCQ, SCQ and
JUCQ reformulations coincide.
It is easy to see that for more complex queries, each atom may have a large expansion,
and the possible combinations among these expansions lead to a potentially large reformulation.
The interest of SCQs and JUCQs, which are clearly equivalent to the UCQ reformulation, is
to propose different orders among the query operators, which oftentimes lead to more efficient
evaluation through a standard RDBMS.
1 N 1 N 1 N 1 N
DB Table Row Column Cell
Distributed file systems, however, do not provide fine-grained data access, and thus, se-
lective access to a piece of data can only be achieved at the granularity of a file. There have
been works like [Dittrich et al., 2010, 2012] which extend Hadoop and improve its data ac-
cess efficiency with indexing functionality, but the proposed techniques are yet to be adopted by
Hadoop’s development community.
Accumulo is very similar to HBase since it also follows the Bigtable design pattern and is
implemented on top of HDFS. In contrast with HBase and Bigtable, it also provides a server-
side programming mechanism, called iterator, that helps increase performance by performing
large computing tasks directly on the servers and not on the client machine. By doing this, it
avoids sending large amounts of data across the network. Furthermore, it extends the Bigtable
data model, adding a new element to the key called “Column Visibility.” This element stores
a logical combination of security labels that must be satisfied at query time in order for the
key and value to be returned as part of a user request. This allows data with different security
requirements to be stored in the same table. As a consequence, users can see only those keys and
values for which they are authorized.
Cassandra is also inspired by Bigtable and implemented on top of HDFS, thus sharing a
lot of similarities with Accumulo and HBase. Nevertheless, it has some distinctive features.
It extends the Bigtable data model by introducing supercolumns. A storage model with super-
columns looks like: {rowkey:{superkey:{columnkey:value}}}. Supercolumns can be either stored
based on the hash value of the supercolumn key or in sorted order. In addition, supercolumns
can be further nested. Cassandra natively supports secondary indices, which can improve data
access performance in columns whose values have a high level of repetition. Furthermore, it has
configurable consistency. Both read and write consistency can be tuned, not only by level, but in
extent. Finally, Cassandra provides an SQL-like language, CQL, for interacting with the store.
SimpleDB is a non relational data store provided by Amazon which focuses on high availabil-
ity (ensured through replication), flexibility, and scalability. SimpleDB supports a set of APIs
to query and store items in the database. A SimpleDB data store is organized in domains. Each
domain is a collection of items identified by their name. Each item contains one or more at-
tributes; an attribute has a name and a set of associated values. There is a one-to-one mapping
from SimpleDB’s data model to the one proposed by Bigtable shown in Figure 2.5. Domains
16 2. PRELIMINARIES
correspond to tables, items to rows, attributes to columns, and values to cells. The main op-
erations of SimpleDB API are the following (the respective delete/update operations are also
available).
• ListDomains() retrieves all the domains associated with one Amazon Web Services (AWS)
account.
• PutAttributes(D, k, (a,v)+) inserts or replaces attributes (a,v)+ into an item with name
k of a domain D. If the item specified does not exist, SimpleDB will create a new item.
It is not possible to execute an API operation across different domains as it is not possi-
ble to combine results from many tables in Bigtable. Therefore, if required, the aggregation of
results from API operations executed over different domains has to be done in the application
layer. AWS ensures that operations over different domains run in parallel. Hence, it is beneficial
to split the data in several domains in order to obtain maximum performance. As with most
non-relational databases, SimpleDB does not follow a strict transactional model based on locks
or timestamps. It only provides the simple model of conditional puts. It is possible to update
fields on the basis of the values of other fields. It allows for the implementation of elementary
transactional models such as some entry level versions of optimistic concurrency control.
AWS imposes some size and cardinality limitations on SimpleDB. These limitations in-
clude:
• Number of domains: The default settings of an AWS account allow for at most 250 do-
mains. While it is possible to negotiate more, this has some overhead (one must discuss
with a sale representative etc.—it is not as easy as reserving more resources through an
online form).
• Domain size: The maximum size of a domain cannot exceed 10 GB and the 109 attributes.
• Item name length: The name of an item should not occupy more than 1024 bytes.
• Number of (attribute, value) pairs in an item: This cannot exceed 256. As a consequence, if
an item has only one attribute, that attribute cannot have more than 256 associated values.
DynamoDB was designed to provide seamless scalability and fast, predictable perfor-
mance. It runs on solid state disks (SSDs) for low-latency response times, and there is no limit
on the request capacity or storage size for a given table. This is because Amazon DynamoDB
automatically partitions the input data and workload over a sufficient number of servers to meet
the provided requirements. In contrast with its predecessor (SimpleDB), DynamoDB does not
automatically build indexes on item attributes leading to more efficient insert, delete, and up-
date operations, as well as improving the scalability of the system. Indexes can still be created if
requested.
• A map phase, where the input is divided into sub-inputs, each handled by a different map
task. The map task takes as input key/value pairs, processes them (by applying the opera-
tions defined by the user), and again outputs key/value pairs.
• A shuffle phase, where the key/value pairs emitted by the mappers are grouped and sorted
by key, and are then assigned to reducers.
• A reduce phase, where each reduce task receives key/value pairs (sharing the same key) and
applies further user-defined operations, writing the results into the file system.
To store inputs and outputs of MapReduce tasks, a distributed file system (e.g., GFS,
HDFS, S3) is typically used.
Many recent massively parallel data management systems leverage MapReduce in order
to build scalable query processors for both relational [Li et al., 2014] and RDF [Kaoudi and
Manolescu, 2015] data. The most popular open-source implementation of MapReduce is pro-
vided by the Apache’s Hadoop [Hadoop] framework, used by many RDF data management
platforms [Goasdoué et al., 2015, Huang et al., 2011, Husain et al., 2011, Kim et al., 2011, Lee
and Liu, 2013, Papailiou et al., 2012, 2013, 2014, Ravindra et al., 2011, Rohloff and Schantz,
2010, Schätzle et al., 2011, Wu et al., 2015].
Following the success of MapReduce proposal, other systems and models have emerged,
which extend its expressive power and eliminate some of its shortcomings. Among the
most well-known frameworks are the Apache projects Flink (previously known as Strato-
sphere) [Alexandrov et al., 2014] and Spark [Zaharia et al., 2010].
Spark Spark achieves orders of magnitude better performance than Hadoop thanks to its
main-memory resilient distributed dataset (RDD) [Zaharia et al., 2012]. RDDs are parallel
data structures that lets users persist intermediate results in memory and manipulate them us-
ing a rich set of operators. They provide fault tolerance via keeping lineage information. Spark’s
programming model extends the MapReduce model by also including traditional relational op-
erators (e.g., groupby, filter, join). Operations are either transformations or actions. Transforma-
tions (e.g., map, reduce, join) are performed in a lazy execution mode, i.e., they are not executed
until an action operator (e.g., collect, count) is called. Spark has gained a lot of popularity for
supporting interactive ad hoc batch analytics.
Flink At the core of the Flink platform lies the PACT (Parallelization Contracts) parallel
computing model [Battré et al., 2010], which can be seen as a generalization of MapReduce.
2.2. DISTRIBUTED STORAGE AND COMPUTING PARADIGMS 19
Operator Annotations
Parallelization Compiler Hints
Data Data
Contract User Function (UF)
PACT plans are built of implicitly parallel data processing operators that are optimized and trans-
lated into explicitly parallel data flows by the Flink platform. PACT operators manipulate records
made of several fields; each field can be either an atomic value or a list of records. Optionally,
the records in a given record multiset1 may have an associated key, consisting of a subset of the
atomic fields in those records.
The data input to/output by a PACT operator is stored in a (distributed) file. A PACT
plan is a directed acyclic graph (DAG, in short) of operators, where each operator may have
one or multiple inputs; this contrast with the “linear” pattern of MapReduce programs, where
a single Reduce consumes the output of each Map. As Figure 2.6 shows, a PACT consists of a
parallelization contract, a user function (UF in short), and possibly some annotations and compiler
hints characterizing the UF behavior. The PACT parallelization contract describes how input
records are organized into groups (prior to the actual processing performed by the operator). A
simple possibility is to group the input records by the value(s) of some attribute(s), as customary
in MapReduce, but in PACT other choices are also possible. The user function is executed
independently over groups of records created by the parallelization contract; therefore these
executions can take place in parallel. Finally, annotations and/or compiler hints may be used to
enable optimizations (with no impact on the semantics), thus we do not discuss them further.
Although the PACT model allows creating custom parallelization contracts, a set of them
for the most common cases is built in:
• The Map contract has a single input and builds a singleton for each input record.
• The Reduce contract also has a single input; it groups together all records that share the
same key.
• The Cross contract builds the cartesian product of two inputs, that is: for each pair of
records, one from each input, it produces a group containing these two records.
• The Match contract builds all pairs of records from its two inputs, having the same key
value.
• The CoGroup contract can be seen as a “Reduce on two inputs;” it groups the records from
both inputs, sharing the same key value.
1 Similarly to SQL and SPARL but differently from the classical relational algebra, PACT operates on bags (multisets)
of records rather than sets.
20 2. PRELIMINARIES
Observe that PACT operators provide a level of abstraction above MapReduce by manip-
ulating fine-granularity records, whereas MapReduce only distinguishes keys and values (but
does not model the structure which may exist within them). Further, separating the input con-
tract (which is only concerned with the grouping of input records) from the actual processing
applied in parallel by the operators enables on one hand flexible adaptation to parallel processing,
and on the other hand, smooth integration of user functions; this gives significant generality to
the PACT model.
2.3 SUMMARY
We have presented all the necessary preliminaries for RDF and the cloud. Given these founda-
tions, the reader should be able to follow the next chapters.
In summary, RDF data is organized in triples of the form .s p o/, stating that the subject
s has the property (a.k.a. predicate) p whose value is the object o. RDF data can also be seen
as graphs with s and o being nodes connected with a directed edge from s to o labeled with p .
RDF Schema defines classes as groups of entities and properties as relations among classes. Its
deductive rules lead to new information being inferred through a reasoning process. SPARQL
is a declarative query language for exploring and querying RDF data which mainly consists of a
set of triple patterns, i.e., triples which can contain variables.
To set the background for processing in the cloud we have introduced: (i) existing dis-
tributed file systems, such as HDFS and S3, (ii) the concept of distributed key-value stores and
some stores commonly found in the cloud, such as HBase and DynamoDB, and (iii) distributed
computation frameworks, such as MapReduce and Spark. All these components are used by the
state-of-the-art distributed RDF data management systems that we will detail in the following.
21
CHAPTER 3
1. Each system stores a given RDF graph data in one or more fragments (or partitions).
According to the way in which the set of triples of each partition is defined, we distinguish:
• logical partitioning: each set can be described by a first-order logic query over the
graph triples and
• graph-based partitioning: sets are built by inspecting the connections among the
graph nodes.
2. Each RDF cloud platform relies on a concrete system for storing its RDF partitions; the
functionalities of the latter may be very advanced or on the contrary quite basic. The fol-
lowing alternatives have been investigated for the purpose of storing the data:
Figure 3.1 illustrates a taxonomy of the storage alternatives and the partitioning schemes
currently used in each one. Section 3.1 describes the two partitioning strategies, while Sec-
tions 3.2–3.6 describe the storage alternatives used for storing the data.
Definition 3.1 Storage description grammar. We describe data partitioning methods based
on a context-free grammar consisting of the following rules (where X denotes a terminal and
<X> is a non-terminal symbol):
<DSC> ! <EXP> ({<EXP>})?
<EXP> ! (<NAM>)? <ATT>+ j (<NAM>)? <M>(<ATT>+)
<M> ! HP j CP j GP
! ! ! ! ! ! !
<ATT> ! SjPjOjUjTjCjGj S j P jOj U j T j C jGj*
<NAM> ! [A-Z]+ j
We present the grammar elements in bottom-up fashion, starting from the simplest ele-
ments and gradually increasing complexity. Among the terminal symbols of the grammar, we
use S to denote the subject values, P for the property values, O for the object values, U for re-
source (any URI appearing as subject, property, or object in a graph), T for term (any URIs or
literal appearing anywhere in the graph), C to denote a class appearing in the graph, and G to
refer to the name (or ID) of the graph itself. Further, curly brackets {, } and parentheses (, ) are
also terminal symbols of the grammar.
The core non-terminal symbol is <ATT>, which specifies an attribute stored within a
given storage structure. Its value ranges over the terminals S, P, O, U, T, C, G, to which we
3.1. PARTITIONING STRATEGIES 23
! ! ! ! ! ! !
add a set of corresponding symbols S , P , O j U , T j C , and G: Each of these denotes that
data is stored sorted according to the value of the respective information item (S, P etc.) Further,
<ATT> may be simply specified using the asterisk (*) symbol to denote that the complete
content of a (sub-)graph is stored (instead of writing SPO or SPOG). We use * only when the
order in which data is stored is not important.
The non-terminal symbol <EXP> comprises:
• optionally, a storage structure name (<NAM>);
• optionally, a partitioning method specification (<M>), in which case we use parentheses
to determine the scope of the partitioning (described below); and
• one or several attribute specifications (<ATT>) as above.
The <M> symbol is used to describe the method used to distribute triples across the
distributed store. Two types of methods have been used:
• Hash Partitioning (denoted HP), where the stored tuples are split in partitions based on
the value of a hash function on some of their attributes;
• Capacity Partitioning (denoted CP), where partitions are formed solely with the goal of
fitting a given capacity on each site, without a specific policy on what data goes in each
partition; and
• Graph Partitioning (denoted GP), where partitions are formed based on the connections
among the graph nodes.
Sample productions matching the <EXP> non-terminal include: (i) SO or OS denote
an anonymous (unnamed) storage structure that stores (subject, object) pairs, (ii) Triples{SPO}
! !
represents a data structure named Triples holding all triples of the dataset, (iii) S O and O S
both denote an anonymous storage structure comprising (subject, object) pairs, sorted in the
order determined by the subject, (iv) HP(SO) denotes a data structure storing the subject and
object of all triples from the graph, partitioned by the subject and object; there are as many
fragments in this data structure as there are distinct (subject, object) value combinations.
Note that in a combination of ATT symbols where several have overhead arrows, the order
of such symbols in the combination denotes the order according to which data is sorted in the respective
!!
storage structure. For example, O S denotes (subject, object) pairs sorted first by the object and
!!
then by the subject, whereas S O denotes a set of such pairs sorted first by the subject and then
! !
by the object. Equivalently, we may also write OS to denote the former or SO, to denote the
latter.
Finally, <DSC> is a storage structure description. In its simple case, the storage structure
is a collection of items, in which case a single <EXP> is sufficient to describe it; or a map, in
24 3. CLOUD-BASED RDF STORAGE
which case it is specified by two <EXP> symbols, one for the key and one for the value of the
map. Note that the second <EXP> may in turn be either a collection or a map, thus allowing
for nested maps, or a map where a value is a collection of maps, etc.
1 http://glaros.dtc.umn.edu/gkhome/project/gp/overview
3.2. STORING IN DISTRIBUTED FILE SYSTEMS 25
3.2 STORING IN DISTRIBUTED FILE SYSTEMS
As discussed in Section 2.2.1, distributed file systems (DFS) are designed for providing scalable
and reliable data access in a cluster of commodity machines and, thus, they are a good fit for
storing large files of RDF data in the cloud. Large files are split into smaller chunks of data, dis-
tributed among the cluster machines and automatically replicated by the DFS for fault-tolerance
reasons. As distributed file systems do not provide fine-grained data access, and thus, selective
access to a piece of data can only be achieved at the granularity of a file, the RDF community
has recently paid more attention to distributed key-value stores for selective data access.
Still, there are systems that use DFS to store RDF data due to its ease of use. We classify
these systems according to the way they model the data as follows: (i) the triple model, which
preserves the triple construct of RDF, (ii) the vertical partitioning model, which splits RDF
triples based on their property, and (iii) the entity-based model, which uses a high-level entity
graph to partition the RDF triples.
The above vertical partitioning scheme is used by systems described in Husain et al. [2011],
Ravindra et al. [2011], Zhang et al. [2012]. It is also reminiscent of the vertical RDF partitioning
proposed in Abadi et al. [2007], Theoharis et al. [2005] for centralized RDF stores. Such a
storage can be described as HP(P){SO} if one considers (as is usually the case) that the value of
the property is used for a hash-based assignment of files to the cluster nodes.
Following the predicate-based partitioning scheme, all triples using the built-in special
RDF predicate :type are located in the same partition. However, because such triples appear
very frequently in RDF datasets, this leads to a single very big file containing such triples. In
addition, most triple patterns in SPARQL queries having :type as a predicate usually have the
object bound as well. This means that a search for atoms of the form (:x, :type, :t1), where :t1
is a concrete URI, in the partition corresponding to the :type predicate is likely to be inefficient.
For this reason, in HadoopRDF [Husain et al., 2011] the :type partition is further split based
on the object value of the triples. For example, the triple (:picasso, :type, :cubist) is stored
in a file named “type#cubist” which holds all the instances of class :cubist. This partitioning
is performance-wise interesting for the predicate :type because the object of such triples is an
RDFS class and the number of distinct classes appearing in the object position of :type triples
is typically moderate. Clearly, such a partitioning is less interesting for properties having a large
number of distinct object values, as this would lead to many small files in the DFS. In general,
having many small files in the DFS leads to a significant overhead at the server (the so-called
namenode), and thus should be avoided.
HadoopRDF [Husain et al., 2011] goes one step further by splitting triples that have the
same predicate based also on the RDFS class the object belongs to (if such information exists).
This can be determined by inspecting all the triples having predicate :type. For example, the
triple (:picasso, :paints, :guernica) would be stored in a file named “paints#artifact” because of
the triple (:guernica, :type, :artifact). This grouping helps for query minimization purposes,
as we shall see in Section 4.1.1.
Predicate-based partitioning is clearly helpful in order to provide selective data access;
however, in distributed systems this partitioning cannot help with data locality issues. Triples
with the same predicate are guaranteed to be at the same node, but SPARQL queries typically
involve more than one predicate. Such queries will have to shuffle large amounts of data in order
to evaluate the query. CliqueSquare [Goasdoué et al., 2015], inspired by works like Cai and
Frank [2004] which store RDF in P2P networks, minimizes network traffic by extending the
predicate-based partitioning of HadoopRDF with an additional hash-partitioning scheme as
follows.
First, CliqueSquare replicates the data three times, and each copy is partitioned in a dif-
ferent way: by the subject, the property, and the object, respectively. Each copy of the data is
distributed among the available machines by hashing their partitions, based on the values of
3.2. STORING IN DISTRIBUTED FILE SYSTEMS 27
Table 3.1: Hash partitioning of RDF triples in 3 nodes
the subject, property, and object, respectively. Table 3.1 shows the triples of the running exam-
ple partitioned across three machines; the partition attribute appears underlined. Observe that
triples having the same value in the subject, property, or object position are guaranteed to be at
the same node.
Second, at each node, a separate file will be created for each predicate present on that
node, in a fashion reminiscent to that of HadoopRDF. For instance, continuing with our ex-
ample, node 1 will have a file named “1_name-S” holding triple :picasso :name ``Pablo", and
a file named “1_name-O” holding the triple :rodin :name ``Auguste", instead of having a sin-
gle file “1_name” containing both triples. The extra splitting based on RDFS classses (used in
HadoopRDF) is also present in CliqueSquare. In Chapter 5 we shall explain how this additional
partitioning step can be exploited by the query optimizer to minimize data shuffling.
An alternative vertical partitioning model would be to partition triples based on their
subject or object value, instead of the property. However, this would lead to a high number of
very small files because of the number of distinct subject or object values appearing in RDF
datasets, which is to be avoided, as explained above. In addition, most SPARQL queries specify
the predicate in the triple patterns, while unspecified subject or object values are more common.
28 3. CLOUD-BASED RDF STORAGE
Thus, predicate-based partitioning makes selective data access (i.e., access to triples stored in a
specific relatively small file) more likely.
An extended version of the vertical partitioning model is proposed in S2RDF [Schätzle et
al., 2016], which is built on top of Spark. The basic idea is to precompute semi-join reductions for
each type of join (SS, OS, and SO) of all pairs of predicates. These smaller tables determine the
contents of a predicate file that are guaranteed to join with the contents of another predicate file.
For instance, using our running example (Figure 2.2), S2RDF creates a file “creates-name_SS”
which contains the pair :rodin :thinker, which is the result of semi-joining creates with name
on the subjects. This reduces the size of input data for a query that needs to join the predicates
creates and name on the subjects. To reduce the storage overhead that naturally comes with this
scheme, S2RDF supports an optional selectivity threshold which determines when the reduction
of the predicate files pays off for the extended files to be materialized.
3.3.1 TRIPLE-BASED
RDF indexing has been thoroughly studied in a centralized setting [Neumann and Weikum,
2010a, Weiss et al., 2008a]. Given the nature of RDF triples, many common RDF indexes
store all the RDF dataset itself, eliminating the need for a “store” distinct from the index. Thus,
indexing RDF in a key-value store is mostly the same as storing it there.
As we have seen previously (in Section 2.2.2), key-value stores
typically offer four levels of information for structuring the data,
i.e., {tablename:{itemkey:{attributename:{attributevalue}}}}. This structure can be eas-
ily captured by the storage description grammar from Definition 3.1 as a four-level map. In
the following, we use this grammar to present the data layout of the various key-value based
systems.
Centralized systems use extensive indexing schemes using all possible permutations of
the subject, predicate, object of RDF triples to build indices. For example, Hexastore [Weiss et
al., 2008a] uses all 3Š D 6 permutations of subject, predicate, object to build indices that pro-
vide fast data access for all possible triple patterns and for efficiently performing merge-joins.
RDF-3X [Neumann and Weikum, 2010a] additionally uses aggregated indices on subsets of
the subject, predicate, object, resulting in a total of 15 indices. The same indexing scheme has
recently migrated to the cloud with H2 RDF+ [Papailiou et al., 2013]. However, this extensive
indexing scheme has a significant storage overhead, which is amplified in a distributed environ-
ment (where data is replicated for fault tolerance). As a consequence, the majority of existing
systems (including the predecessor of H2 RDF+, H2 RDF [Papailiou et al., 2012]) use a more
conservative approach with only three indices. The three permutations massively used by to-
day’s systems are: subject-predicate-object (SPO), predicate-object-subject (POS), and object-
subject-property (OSP). Typical key-value RDF stores materialize each of these permutations
in a separate table (collection).
Depending on the specific capabilities of the underlying key-value store, different choices
have been made for the keys and values. For example, each one of S, P, O can be used as the key,
attribute name, attribute value of the key-value store, or a concatenation of two or even three
of the elements can be used as the key. One of the criteria to decide on the design is whether
the key-value store offers a hash-based or sorted index. In the first case, only exact lookups
are possible and thus, each of the triples’ element should be used as the key. In the latter case,
combinations of the triples’ element can be used since prefix lookup queries can be answered.
30 3. CLOUD-BASED RDF STORAGE
Table 3.2: SPO index in a key-value store with a hash index (left) and a sorted index (right)
Item Key (attr. name, attr. value) Item Key (attr. name,
attr. value)
:picasso (:paints, :guernica), :picasso :paints :guernica (-, -)
(:name, "Pablo"), :picasso :name "Pablo" (-, -)
(:type, :cubist) :picasso :type :cubist (-, -)
:rodin (:creates, :thethinker), :rodin :creates :thethinker) (-, -)
(:name, "Auguste"), :rodin :name "Auguste" (-, -)
(:type, :sculptor) :rodin :type :sculptor (-, -)
:guernica (:exhibited, :reinasofia), :guernica :exhibited :reinasofia (-, -)
(:type, :artifact) :guernica :type :artifact (-, -)
:thethinker (:exhibited, :museerodin) :thethinker :exhibited :museerodin) (-, -)
:reinasofia (:located, :madrid) :reinasofia :located :madrid (-, -)
:museerodin (:located, :paris) :museerodin :located :paris (-, -)
:Cubist (:sc, :painter) :cubist :sc :painter (-, -)
:Sculptor (:sc, :artist) :sculptor :sc :artist (-, -)
:Painter (:sc, :artist) :painter :sc :artist (-, -)
The left part of Table 3.2 shows a possible design for the SPO index in a hash-based
key-value store using the RDF running example of Figure 2.2. Subjects are used as the keys,
predicates are used as the attribute names, and objects as the attribute values. Using our storage
description grammar, this index is described as follows: SPO{S{P{O}}}. Similar POS and OSP
indices are constructed. When the key-value store offers a sorted index on the key, any concate-
nation of the triples’ elements can be used as the key. In the right part of Table 3.2, we show for
the SPO index the extreme case where the concatenation of all three triples’ elements are used
as the key, while the attribute names and values are empty. The storage descriptor for this index
!
is SPO.
Representative systems using key-value stores as their underlying RDF storage facility
include Rya [Punnoose et al., 2012], which uses Apache Accumulo [Accumulo]; Cumulus-
RDF [Ladwig and Harth, 2011] based on Apache Cassandra [Cassandra]; Stratustore [Stein
and Zacharias, 2010b], relying on Amazon’s SimpleDB [Amazon Web Services]; H2 RDF [Pa-
pailiou et al., 2012]; H2 RDF+ [Papailiou et al., 2013], built on top of HBase HBase; and
AMADA [Aranda-Andújar et al., 2012] which uses Amazon’s DynamoDB [Dynamo].
Table 3.3 outlines for each system the key-value store used, the type of index data structure
provided by the key-value store, and the storage description. Observe that in some cases some
data positions (attribute names and/or attribute values) within the key-value stores are left empty.
3.3. STORING IN KEY-VALUE STORES 31
Table 3.3: Indices used by RDF systems based on key-value stores
An important issue in such approaches is the high skewness of the property values,
i.e., some property values appear very frequently in the RDF datasets. In this case, a stor-
age scheme using the property as the key leads to a table with few, but very large, rows. Because
in HBase all the data of a row is stored at the same machine, machines corresponding to very
popular property values may run out of disk space and further cause bottlenecks when processing
queries. For this reason, MAPSIN completely discards the POS index, although techniques for
handling such skew, e.g., splitting the long list to another machine [Abiteboul et al., 2008],
are quite well understood by now. To handle property skew CumulusRDF, built on Cassandra,
builds a different POS index. The POS index key is made of the property and object, while the
subject is used for the attribute name. Further, another attribute, named P, holds the property
value. The secondary index provided by Cassandra, which maps values to keys, is used to retrieve
the associated property-object entries for a given property value. This solution prevents overly
large property-driven entries, all the while preserving selective access for a given property value.
Notice that H2 RDF, and H2 RDF+, avoid the skew problem by using only the key of HBase;
32 3. CLOUD-BASED RDF STORAGE
:paints :exhibited
:picasso :guernica :reinasofia
:rodin :paris
node 1
:name :name :creates :located :located
:exhibited
“Pablo” :thethinker :museerodin
“Auguste” :madrid
node 2
although they partially lose the locality of some triples, they benefit from a better distribution
of rows among machines.
3.3.2 GRAPH-BASED
Graph-oriented storage for RDF data is less studied by the research community but is gaining
more attention lately. Unsurprisingly, the proliferation of distributed key-value stores has set up
the ground for developing more graph-oriented RDF stores in the cloud.
Trinity.RDF [Shao et al., 2013], proposed by Microsoft, is the first system adopting a
graph-oriented approach for RDF data management using cloud infrastructures. It uses as its
underlying storage and indexing mechanism Trinity [Shao et al., 2013], a distributed graph
system built on top of a memory key-value store. An RDF graph is distributed among the
cluster machines by hashing the values of the nodes of the RDF graph (subjects and objects of
RDF triples); thus, each machine holds a disjoint part of the RDF graph. Figure 3.2 depicts a
possible partitioning of the RDF graph of Figure 2.2 in a cluster of two machines.
In Trinity.RDF, the RDF graph (where resources are nodes and triples are edges) is split
into partitions, where each resource is assigned to a single machine. However, because some
edges inevitably will traverse the partitioning, each machine will also store some such edges that
are outgoing from the resources assigned to that machines. In other words, these are triples
whose subject URI is part of that machine’s partition, while the object URI is not.
Within each machine, the URIs (or constants) labeling the RDF graph nodes are used
as keys, while two adjacency lists are stored as values: one for the incoming edges and another
for the outgoing one. Using our storage description, this corresponds to an S{PO} and O{PS}
index, respectively. Since these two indices do not allow retrieving triples by the value of their
property, e.g., to match (?x, :name, ?y), Trinity.RDF comprises a separate index whose keys
are the property values and whose values are two lists, one with the subjects and another with
3.3. STORING IN KEY-VALUE STORES 33
Table 3.4: Indexing in Trinity.RDF at a two-machines cluster. Key-value store at machine 1
(left) and machine 2 (right).
the objects of each property. This approach amounts to storing the indices P{S} and P{O}, in
our notation.
Table 3.4 illustrates the RDF graph partitioning of Figure 3.2 in the key-value store of
Trinity. Each node in the graph is stored as a key in one of the two machines together with its
incoming and outgoing edges; graph navigation is implemented by lookups on these nodes. The
example shown in Table 3.4 is the simplest way of RDF indexing in Trinity, where IN and OUT
denote the lists of incoming and outgoing edges, respectively. In Shao et al. [2013], the authors
propose a further modification of this storage model aiming at decreasing communication costs
during query processing. This is achieved by partitioning the adjacency lists of a graph node by
machine so that only one message is sent to each machine regardless of the number of neighbors
of that node.
Stylus [He et al., 2017] is a more recent RDF store built on top of Trinity [Shao et al.,
2013]. The RDF graph is partitioned over the compute nodes using random hashing. The par-
ticularity of Stylus is its compact storage scheme based on templates. These templates are in-
spired by the characteristic sets proposed in Neumann and Moerkotte [2011], where the authors
made the observation that in most RDF graphs the property edges exhibit a certain structure,
e.g., :painters tend to have a :name and :paints property. Stylus uses a compact data structure
based on user-defined types, called xUDT, to represent such groups of properties and the RDF
34 3. CLOUD-BASED RDF STORAGE
Table 3.5: Indexing in Stylus
triples. Each group of properties {:p1 , :p2 , ..., :pn } is stored in the key-value store as: (gid , {:p1 ,
:p2 , ..., :pn }) where gid is an identifier of the group. Then, each RDF subject id having the
group of properties gid is stored as: (id , {gid , offsets, obj _vals }) where obj _vals represents
a list of object values of the triples having subject id and offsets is a list of integers specifying
the offset of the object values for the properties of the current subject. Table 3.5 shows an ex-
ample of Stylus storage scheme for our running example. The table on the left keeps the group
of properties, while the table on the right the information for the RDF subjects.
Stylus storage scheme provides not only efficient data access but also can reduce the num-
ber of joins required to answer a query. Being also in a main-memory key-value store, it achieves
great performance speed-ups. This comes at a cost of longer times for loading the input data.
partition 1 partition 2
:paints :exhibited :located
:picasso :guernica :reinasofia :madrid
:name
"Pablo"
partition 3 partition 4
"Auguste"
"Pablo"
partition 4
partition 3
:creates :exhibited :located
:rodin :thethinker :museerodin :paris
:name
"Auguste"
"Pablo"
partition 4
partition 3
:creates :exhibited :located
:rodin :thethinker :museerodin :paris
:name
"Auguste"
2 The exact definition is a bit more involved, allowing for directed cycles to occur in specific places with respect to the
end-to-end path, in particular at the end; a sample end-to-end path is u1 ; u2 ; u3 ; u4 ; u5 ; u3 , for some ui URIs in the RDF
graph, 1 i 5.
3.4. STORING IN MULTIPLE CENTRALIZED RDF STORES 37
problem is NP-Hard, thus they propose an approximate algorithm for solving it. Similar to
SHAPE and H-RDF-3X, they store the derived partitions in local RDF-3X stores.
In Wu et al. [2012], HadoopDB [Abouzeid et al., 2009] is used, where RDF triples are
stored in a DBMS at each node. Triples are stored using the vertical partitioning proposed
in Abadi et al. [2007], i.e., one table with two columns per property. Triple placement is de-
cided by hashing the subjects and/or objects of the triples according to how the properties are
connected in the RDF Schema. If the RDF Schema is not available, a schema is built by ana-
lyzing the data at loading time.
DREAM [Hammoud et al., 2015] proposes to replicate the entire RDF graph to all com-
pute nodes with the assumption that the graph fits in the compute nodes. The benefit of this
approach is that there is no preprocessing (partitioning) cost when storing the data as well as
reducing the communication cost. DREAM uses RDF-3X in each node to store the RDF graph.
Finally, gStoreD [Peng et al., 2016b] stores each partition of the RDF graph in a local
gStore instance. gStore [Zou et al., 2011] is a centralized graph-based RDF store which uses
the typical adjacency list representation for storing RDF graphs backed by structural indices
capable of pruning large parts of the dataset. gStoreD is oblivious to the partitioning strategy
of the RDF graph and, thus, one can partition the RDF graph in any way. This is thanks to
their “partial evaluation and assembly” query evaluation strategy, as we will discuss in the next
chapter.
Workload-Based
There are a few works that take a query workload into consideration for determining the best
partitioning strategy.
WARP [Hose and Schenkel, 2013a] extends the partitioning and replication techniques
of Huang et al. [2011] to take into account a query workload in order to choose the parts of
RDF data that need to be replicated. Thus, rarely used RDF data does not need to be replicated,
leading to a reduced storage overhead compared to the one of Huang et al. [2011].
Partout [Galarraga et al., 2012] is a distributed RDF engine also concerned with partition-
ing RDF, inspired by a query workload, so that queries are processed over a minimum number
of nodes. The partitioning process is split in two tasks: fragmentation, i.e., split the triples into
pieces, and fragment allocation, i.e., in which node each piece will be stored. The fragmentation
is based on a horizontal partitioning of the triple relation based on the set of constants appearing
in the queries, while the fragment allocation is done so that most of the queries can be executed
locally at one host but at the same time maintaining load balancing. This is done by looking for
the number of queries for each fragment that need to access it as well as the number of triples
required for such query. The triple partitioning process takes into consideration load imbalances
and space constraints to assign the partitions to the available nodes.
In Peng et al. [2016a] the authors first mine patterns from the workload with high-access
frequencies. Then, based on these frequent access patterns, they propose two different parti-
38 3. CLOUD-BASED RDF STORAGE
tioning policies: one that stores the RDF subgraph that matches a frequent access pattern as a
whole in one compute node (vertical partitioning) and one that splits the RDF subgraph that
matches a frequent access pattern and stores it in different compute nodes (horizontal partition-
ing). In the vertical partitioning, when a query that contains a frequent access pattern arrives,
it will typically be evaluated locally on only one compute node. In this way, multiple compute
nodes can be used to evaluate multiple queries simultaneously. While the vertical partitioning
policy improves query response time as well as query throughput due to its local evaluation, the
horizontal partitioning can improve query response time due to its high parallelism.
• attribute-based indexing, which uses three indices: S{G}, P{G}, O{G}; and
• attribute-subset indexing, which uses seven indices: S{G}, P{G}, O{G}, SP{G}, PO{G},
SO{G}, SPO{G}.
The indices outlined above enable efficient query processing in AMADA through routing
incoming queries (only) to those graphs that may have useful results, as we explain in Chapter 4.
These datasets are then loaded in a centralized RDF store.
3.7 SUMMARY
Table 3.6 summarizes the back-end and storage layout schemes used by the RDF stores. For
each, we also present the benefit of its chosen storage scheme, outlining three classes of back-
ends used for storing RDF triples: distributed file systems, key-value stores, and centralized
RDF stores.
We observe that almost all systems based on a key-value store adopt a three-indices
scheme, and specifically the indices SPO, POS, and OSP. Only Trinity.RDF [Shao et al., 2013]
adopts a slightly different indexing scheme, with a second-level index as described above. Finally,
40 3. CLOUD-BASED RDF STORAGE
Table 3.6: Comparison of storage schemes. Storage systems used: DFS=Distributed file system,
KVS=Key-value store, CS=Centralized store, MS=Memory store. (Continues.)
Table 3.6: (Continued.) Comparison of storage schemes. Storage systems used: DFS=Distributed
file system, KVS=Key-value store, CS=Centralized store, MS=Memory store.
CHAPTER 4
Hybrid Approaches
An alternative approach is presented in AMADA [Aranda-Andújar et al., 2012, Bugiotti et
al., 2012, 2014], where data resides in a cloud store and indices are kept in a key-value store.
46
DL(PO|P|P)+
(?s, p, ?o) PL(POS) PL(PO|S) DL(P|O|S) S(*)+ssel SDB(S|P|O) DL(P|{O}S| )
DL(PO|S| )
(?s, ?p, o) PL(OSP) PL(OS|P) DL(O|S|P) DL(O|S|P) S(*)+csel DL(O|{S}P| ) DL(O|SP| )
(?s, ?p, ?o) S(*) S(*) S(*) S(*) S(*) S(*) S(*)
4.1. RELATIONAL-BASED QUERY PROCESSING 47
Query processing is achieved by selecting a (hopefully tight) superset of the RDF datasets which
contain answers to a given query. This is done by advising the available indices (see Section 3.6).
Then, the selected RDF datasets are loaded at query time in a centralized state-of-the-art RDF
store which gives the answer to the query.
1 Even though TensorRDF is not explicitly relational-based, its tensor calculus operations naturally match the relational
model operations.
48 4. CLOUD-BASED SPARQL QUERY PROCESSING
#4
SELECT ?y ?z remove duplicates map
WHERE {?x :type :artist .
?x :paints ?y . #4 project (?y, ?z) reduce
?y :exhibited ?z . }
#3
join (?y)
#2
join(?x)
#3
#2 select ( p=:paints ) select (p=:exhibited )
#1
remove duplicates
#1 scan( *.rdf)
select (p=:type, o=:artist) scan( *.rdf)
scan( *.rdf)
Figure 4.3: MapReduce join evaluation in SHARD based on Rohloff and Schantz [2010].
use, i.e., the physical operators. We discuss the order in which these joins can be executed in the
next chapter.
Join evaluation is a challenging problem in cloud-based RDF systems because neither
key-value stores nor MapReduce inherently support joins. For this reason, very early works on
indexing RDF in key-value stores do not handle joins, such as CumulusRDF [Ladwig and
Harth, 2011] which supports only single triple-pattern queries. Among the rest of systems,
there is a large amount of works that use MapReduce to evaluate joins. For this reason, in
the following, we distinguish between works that use MapReduce-based joins and works that
perform join evaluation out of MapReduce, often at a single site.
MapReduce-Based Joins
The first system to use MapReduce for SPARQL query evaluation is SHARD [Rohloff and
Schantz, 2010]. In SHARD, one MapReduce job is initialized for each triple pattern. In ad-
dition, a last job is created for removing redundant results (if necessary) and outputting the
corresponding values for the variables that need to be projected. In the map phase of each job,
the triples that match the triple pattern are sent to the reducers. In the reduce phase, the matched
triples are joined with the intermediate results of the previous triple patterns (if any). Figure 4.3
illustrates the join evaluation using a SPARQL query with the three of the triple patterns of the
running example of Figure 4.1. Conceptually, SHARD’s query evaluation strategy leads to left-
deep query plans, which are often used in traditional centralized databases. We discuss about
different query plans in the next chapter. Note that for a three-triple pattern query, four jobs
are required, while all data is scanned three times. This leads to much I/O overhead. Subsequent
works aim at improving this overhead.
In Schätzle et al. [2011] the authors propose a mapping from full SPARQL 1.0 to Pig
Latin [Olston et al., 2008], a higher level language of MapReduce which provides higher-level
primitives such as filters, joins, and unions. Each triple pattern is transformed to a Pig Latin
4.1. RELATIONAL-BASED QUERY PROCESSING 49
R S
?x ?y ?y ?z
a1 b1 b1 c1
B
a2 b2 b2 c2
a3 b2 b4 c2
?x ?y ?x ?y ?x ?y ?x ?y ?x ?y ?x ?y
a1 b1 a2 b2 a3 b2 b1 c1 b2 c2 b4 c2
Map
Shuffle
Reduce
filter operation, while the triple patterns are successively joined one after the other; Left-outer
joins and unions are also used for more complex SPARQL queries, like ones with OPTIONAL
and UNION expressions. The way the joins are implemented and any other implementation
issue is left to the Apache Pig system.2
HadoopRDF uses a hash-based repartition join [Blanas et al., 2010]. Figure 4.4 illustrates
this type of join on two relations R and S . The columns are variables of a SPARQL query and
the join is on variable ‹y . Cliquesquare uses the hash-based repartition join when joining inter-
mediate results. When joining triple patterns, data is co-partitioned thanks to its partitioning
scheme. For this reason, Cliquesquare uses a Map-join where hash joins happen locally at the
map phase.
Although H2 RDF+ stores the RDF data in a key-value store, it uses MapReduce query
plans to evaluate queries which have been predicted to be non-selective, and thus, benefit from
parallelization of processing of a big amount of data. H2RDF+ [Papailiou et al., 2013] takes
advantage of the indices kept in HBase and, when joining triple patterns, it uses a merge join
2 https://pig.apache.org/
50 4. CLOUD-BASED SPARQL QUERY PROCESSING
with a Map-only job. When joining a triple pattern with intermediate results (which are not
sorted), it uses a sort-merge join. In case both relations to be joined are intermediate results, it
falls back to the hash-based repartition join. In Przyjaciel-Zablocki et al. [2012], a Map-side
index nested loops algorithm is proposed for joining triple patterns. The join between two triple
patterns is computed in the Map-phase of a job by retrieving values that match the first triple
pattern from the key-value store, injecting each of these values into the second triple pattern and
performing the corresponding lookup for the second pattern in the key-value store. No shuffle or
reduce phases are required. As an optimization, the triple patterns that share a common variable
on the subject (or object) are grouped and evaluated at once in a single Map phase. In Zhang et
al. [2012], Bloom filter joins are executed whenever one of the input is small/selective enough
that it can be used to reduce the data from the other inputs.
Note that there has been extensive research on different types of relational joins (theta-
joins, equi-joins, similarity joins, etc.) and how they can be implemented in MapReduce. Among
the most popular techniques is the repartition join, where both relations get partitioned and
joined in different nodes, and the broadcast join, where the smallest relation is broadcast to all
nodes and joined with the partitions of the larger relation. The interested reader may refer to
this survey in Li et al. [2014] for more information.
sent to node 1 to determine the nodes holding the object values :picasso; in our example, these
are nodes 1 and 2. From node 1 we retrieve the object value :guernica, while a message is sent
to node 2, where the object value ``Pablo" is found.
If neither of the subject or object of the triple pattern are constants, the POS index is used
to find matches for the given property. Graph exploration starts in parallel from each subject
(or object) found and by filtering out the target values not matching the property value. If the
predicate is also unbound, then every predicate is retrieved from the POS index. For example,
assume again Figure 3.2 and the triple pattern (?s, :name, ?o). From the POS index (see Ta-
ble 3.4), two subject values are found: :picasso and :rodin. For both of these values, the above
procedure is again followed; moreover, the object values :guernica and :thinker are filtered out
because the property value is not matched.
Similarly, Spartex [Abdelaziz et al., 2017b] utilizes its per-vertex two index structures
(PS and PO). Given a property (edge), all nodes of the graph check their indices to extract the
subjects or objects of triples having this property value.
Join Evaluation
Join evaluation in most graph-based systems is done through graph exploration. For conjunctive
queries, triple patterns are processed sequentially: One triple pattern is picked as the root and a
chain of triple patterns is formed. The results of each triple pattern guide the exploration of the
graph for the next one. This completely avoids manipulating triple patterns that may match a
triple but do not match another one considered previously during query evaluation. Figure 4.5
illustrates how the query of Figure 4.1 is processed in Trinity.RDF, assuming the given triple
pattern evaluation order. Spartex works similarly in joining the triple patterns.
This graph exploration resembles the nested indexed loop join algorithm also used in
Rya [Crainiceanu et al., 2012] and H2 RDF [Papailiou et al., 2012], with the difference that
only the matches of the immediate neighbors of a triple pattern are kept and not all the history
of the triple patterns’ matches during the graph exploration, and thus, invalid results may be in-
cluded in the results. For this reason, a final join is required at the end of the process to remove
any invalid results that have not been pruned through the graph exploration. This join typically
involves a negligible overhead.
A different approach is followed in Peng et al. [2016b] and Stylus [He et al., 2017]. They
use a “partial evaluation and assembly.” It sends the entire query plan to all compute nodes;
4.3. SUMMARY 53
each node executes the plan and sends back the partial results to the coordinator; finally, the
coordinator gathers all partial results and computes the final result.
4.3 SUMMARY
Table 4.2 summarizes the query processing strategy of each of the presented systems. In contrast
to the storage-based categorization of the existing systems, where each fit in only one category,
the query-based classification is less clear cut, with some systems pertaining to more than one
class. For example, H2 RDF uses both a MapReduce-based query evaluation and a local eval-
uation depending on the query selectivity, while Huang et al. [2011] use a hybrid approach
between MapReduce and evaluation on multiple centralized RDF stores. We view this diver-
54 4. CLOUD-BASED SPARQL QUERY PROCESSING
Table 4.2: Comparison of query processing strategies (Continues.)
sity as proof of the current interest in exploring various methods—and their combinations—for
massively distributed RDF query processing.
57
CHAPTER 5
?x
?y ?z :located :paris ?x
?x ?x :name ?z
?x ?y :exhibited ?z
(a) A left-deep query plan for (b) Binary vs. n-ary joins in SPARQL queries
the query of Figure 4.1
Node loi is a parent of node loj iff the output of loi is an input of loj . Often, but not
always, plans are in fact trees, that is, each node has exactly one parent. Figure 5.1 shows several
sample query plans for the query in Figure 4.1.
Query plans can be: (i) linear trees, where one child of each join operator is a leaf; without
loss of generality, considering that the non-leaf operator is always the first child of its parent,
linear trees can be reduced to left-deep ones; or (ii) bushy trees, where two or more children of an
operator can be joins. All the plans in Figure 5.1 are linear. Following a popular heuristic intro-
duced in centralized database management systems, most works on SPARQL query processing
only consider left-deep plans, which leads to a significant reduction in the search space. For a
query over m triple patterns, there are mŠ possible linear plans, whereas there are O.m3 / bushy
ones. In contrast, some recent systems explore both linear and bushy plans; the main interest
of the latter is to exploit the parallel computing potential of a MapReduce environment. Join
operators at the same level in a bushy plan can be executed in parallel.
Another dimension that distinguishes the different works is the use of binary joins only vs.
n-ary joins for some n > 2. SPARQL queries where a variable is shared by more than two triple
patterns are quite frequent. In these cases, instead of joining the triple patterns pairwise, one
can use an n-way join (with n > 2) to join all these triple patterns at the same time. Figure 5.1
illustrates the difference a SPARQL query containing three triple patterns, which have ?x as
a common variable. The interest of n-ary join operators is to compute in a single operation
the result of joins across more than one relation. Traditional query evaluation algorithms were
developed at a time where the available memory was often the bottleneck, and evaluating several
join conditions simultaneously (in a single operator) was not an interesting option, since it would
have required more memory to hold the necessary data structures. This limitation no longer
applies in modern environments, and in particular to parallel join algorithms available in our
setting.
5.1. QUERY PLAN SEARCH SPACE 59
?x ?y
?x ?y
Figure 5.2: Triple pattern (?x :paints ?y) is used in two join operators.
Finally, a dimension which impacts the search space is whether a node in the plan (either
triple scan or a join operator) can be input to only one, or to several other nodes (typically joins).
When each node has at most one parent, the plan is a tree; otherwise, it is DAG. Using the results
of a triple pattern or of a join in more than one join operators in a DAG plan is interesting when
the reused result is highly selective with respect to its join partners. Figure 5.2 shows an example
where triple pattern (?x :paints ?y) is used for joining with triple pattern (?x :type :artist)
on ?x and with triple pattern (?y :exhibited ?z) on variable ?y. Such a query plan is beneficial
when there are only few triple patterns with property :paints and thus will decrease the size of
both input relations of the final join.
Early works, such as SHARD [Rohloff and Schantz, 2011], use binary left-deep trees to
evaluate a SPARQL query. In this case, one MapReduce job is required per triple pattern (as
discussed in Section 4.1.2). This leads to n C 1 jobs for a query with n triple patterns which leads
to poor performance.
To overcome the inefficiency of early works stemming from significant overhead of a
MapReduce job [Condie et al., 2010], more recent MapReduce-based proposals use the heuris-
tic of producing query plans that require the least number of jobs. Although traditional selectivity-
based optimization techniques may decrease the intermediary results, they also lead to a growth
in the number of jobs, and thus, to worse query plans with respect to query response time. There-
fore, the ultimate goal of such proposals is to produce query plans in the shape of balanced bushy
trees with the minimum height possible. HadoopRDF [Husain et al., 2011], H2 RDF+ [Papail-
iou et al., 2013], RAPID+ [Ravindra et al., 2011], and Cliquesquare [Goasdoué et al., 2015]
are systems that try to achieve the above goal. A join among two or more triple patterns is per-
formed on the same single variable. Within a job, one or more joins can be performed as long as
they are on different variables. When the query has only one join variable, only one job suffices
for query evaluation.
In HadoopRDF [Husain et al., 2011] a heuristic is used based where as many joins as
possible are performed in each job leading to a query plan with the least number of MapReduce
jobs. Figure 5.3 demonstrates a possible query plan produced by HadoopRDF for the example
query of Figure 4.1. In the first job, the joins on variables ‹x and ‹z are computed between the
60 5. SPARQL QUERY OPTIMIZATION FOR THE CLOUD
job #2
π?y,?z
?y
job #1
?x ?z
σo=: paris
scan ( ∗type artist.rdf ) scan ( ∗paints.rdf )
scan ( ∗exhibited.rdf ) scan ( ∗located.rdf )
Figure 5.3: Bushy query plan in MapReduce as produced by HadoopRDF based on Husain et
al. [2011].
first and the last two triple patterns, respectively. The second job joins the intermediate results
of the first job on variable ‹y . The same heuristic is used in H2 RDF+ [Papailiou et al., 2013] for
non-selective queries.
A similar approach is followed in RAPID+ [Ravindra et al., 2011] where an intermediate
nested algebra is proposed for maximizing the degree of parallelism during join evaluations and
reducing MapReduce jobs. This is achieved by interpreting star joins (triple patterns that share
the same variable on the subject) as groups of triples and defining new operators on these triple
groups. Queries with k star-shaped subqueries are translated into a MapReduce flow with k
MapReduce jobs. The proposed algebra is integrated with Pig Latin [Olston et al., 2008].
In Huang et al. [2011] complex queries, which cannot be evaluated completely by the un-
derlying RDF store due to the partitioning scheme, are evaluated within MapReduce for joining
the results from different partitions. A query is decomposed in subqueries that can be evaluated
independently at every partition, and then the intermediary results are joined through MapRe-
duce jobs sequentially. This creates left-deep query plans with the independent subqueries as
leaves. The number of MapReduce jobs increases with the number of subqueries. Figure 5.4
shows how the example query of Figure 4.1 is evaluated based on a partitioned store providing
the 1-hop guarantee shown in Figure 3.3. The query is decomposed into two subqueries; the first
one contains all three triple patterns, while the second one contains only the last triple pattern.
The results from the two subqueries are joined in a MapReduce job.
In Cliquesquare [Goasdoué et al., 2015], each level of a plan is executed as a single
MapReduce job. Thus, Cliquesquare aims at producing plans as flat as possible. The optimization
algorithm of Cliquesquare is guaranteed to find at least one of the flattest plans (see Chapter 5).
These plans contain n-ary star equi-joins at all levels of a plan. In WARP [Hose and Schenkel,
5.2. PLANNING ALGORITHMS 61
job #1
π ? y,? z
?z
SELECT ?y ?z SELECT ?z
WHERE { WHERE {
?x :type :artist . ?z :located :paris .}
?x :paints ?y .
?y :exhibited ?z .}
2013a] and RAPID+ [Ravindra et al., 2011], the authors proposed to use bushy trees with the
leaf level allowing for n-ary joins (n 2) while the intermediate levels only use binary joins.
variable graph
?x ?y ?z
tp1 tp3 tp3 tp4
?x ?x
tp2
Cliques
Figure 5.5: A BGP query, its variable graph, and sample cliques as defined in CliqueSquare
based on Goasdoué et al. [2015].
label; in other words, a clique corresponds to a set of query triples which share one or several
common variables. The blue dotted rectangles in Figure 5.5 show sample cliques in the variable
graph. CliqueSquare’s plan enumeration algorithm is as follows:
1. Given a query (thus, a variable graph), a decomposition of the graph is a set of cliques which
cover all the nodes of the variable graph. From the variable graph, a first decomposition is
chosen (there are many ways to do so, as we explain below).
2. Each clique from the chosen decomposition which consists of more than one triple pattern
is transformed into an n-ary join of all its triple patterns. The join result features all the
query variables that appeared in one of its inputs.
3. The variable graph is then reduced (transformed into another, smaller graph) having:
4. The above process is repeated from step 1 until the variable graph has exactly one node. At
this point, all the query triples have been joined, which corresponds to the root of a query
plan.
5.2. PLANNING ALGORITHMS 63
Iteration 1 Iteration 2
?x
tp1 tp2
C1 C1 ?y
?x ?x
tp3
?y ?y ?y C3 ?x ?z
tp4
?z C2 C2
tp1 tp2 tp3 tp4 tp5
tp5
Resulting Query Plan
Clique decomposition Clique reduction
Figure 5.6 illustrates the clique decomposition-reduction steps for the variable graph of
Figure 5.5. In the first iteration the clique decomposition outputs two cliques: one for variable
?x and one for variable ?z. In the reduction step of the same iteration, these cliques are reduced
to nodes in the graph joined by edge ?y. Clique C1 denotes the join of tp1 with tp2 , while clique
C2 denotes the join of tp3 with tp4 . In the second iteration, the clique decomposition selects
the only clique possible and the clique reduction reduces the graph to one node and, thus, the
algorithm terminates. The final node denotes the join between the intermediate results of the
two previously identified joins. On the right-hand side, we show the resulted query plan.
It can be easily shown that any logical plan using triple pattern scans and n-ary joins results
from a certain set of clique decompositions successively chosen in step 1 above. Moreover, clique
decompositions can be classified as follows, according to three orthogonal criteria.
• A decomposition may rely on maximal cliques only, or it may consider both partial and
maximal cliques. A clique is maximal in a variable graph if no triple pattern could be
added to it while still respecting the clique definition; partial cliques are those which are
not maximal.
• A decomposition may be partition, when each triple pattern appears exactly in one clique,
or not, i.e., a triple may appear in several cliques.
The above three orthogonal choices lead to a total of eight search spaces for logical plans
based on n-ary joins. Four of them (those where all decompositions are variable graph partitions)
comprise tree plans, while the other four lead to DAGs. The above algorithm can generate any of
64 5. SPARQL QUERY OPTIMIZATION FOR THE CLOUD
these spaces by making the appropriate decisions along the three dimensions mentioned above.
The authors consider the most interesting logical plan space to be the one obtained by using
(i) maximal or partial cliques, which (ii) may or may not form a partition of the variable graph,
and (iii) forming minimum-size decompositions of the variable graph(s).
5.3 SUMMARY
Table 5.1 summarizes the different optimization methods of the RDF systems by showing the
types of joins they use, the type of query trees they support, and the query planning algorithm
they use.
In the experimental study of Abdelaziz et al. [2017a], it is shown that most systems are
optimized for specific datasets and query types. For example, among the MapReduce-based
systems, the optimization technique of CliqueSquare works well for complex queries where
66 5. SPARQL QUERY OPTIMIZATION FOR THE CLOUD
Table 5.1: Query optimization comparison
its flat plans significantly reduce the overhead of distributed joins, while, for selective queries,
H2 RDF avoids the overhead of MapReduce by executing them locally.
67
CHAPTER 6
• query reformulation: reformulate a given query to take into account entailed triples; and
The first method requires computing the entire RDFS closure prior to query processing,
while the second (reformulation) is executed at query time. Finally, in the hybrid approach,
some entailed data is computed statically and some reformulation is done at query time. An
experimental comparison of these RDFS reasoning methods can be found for a centralized
setting in Goasdoué et al. [2013] and for a distributed one in Kaoudi and Koubarakis [2013].
We classify the cloud-based systems that support RDFS reasoning according to these
three categories. We also consider parallel/distributed approaches that were not necessarily in-
tended for the cloud but can be easily deployed therein. The main challenge faced by these
systems is to be complete (answer queries by taking into account both the explicit and the im-
plicit triples) even though the data is distributed. At the same time, the total volume of shuffled
data should be minimized to not degrade performance.
reason, the parallel processing paradigm of MapReduce is suitable for computing the RDFS
closure.
One of the first works providing RDFS closure computation algorithms in a parallel en-
vironment is WebPie [Urbani et al., 2009]. RDF data is stored in a distributed file system and
the RDFS entailment rules of Table 2.2 are used for precomputing the RDFS closure through
MapReduce jobs.
First, observe in Table 2.2 (page 9) that the rules having two triples in the body imply a
join between the two triples because they have a common value. See, for example, rule s1 where
the object of the first triple should be the same as the subject of the second one. By selecting
the appropriate triple attributes as the output key of the map task, the triples having a common
element will meet at the same reducer. Then, at the reducer, the rule can be applied to generate a
new triple, thus allowing to parallelize inference. Figure 6.1 illustrates the application of rule i2
from Table 2.2 within a MapReduce job. In the map phase, the triples are read and a key-value
pair is output. The key is the subject or object of the triple, depending on its type, and the value
is the triple itself. All the triples generated with the same key meet at the same reducer where
the new triple is produced.
Second, entailed triples can also be used as input in the rules. For instance, in the exam-
ple of Figure 2.2, the entailed triple (:picasso, :type, :painter) can be used to infer the triple
(:picasso, :type, :artist). Thus, to compute the RDFS closure, repeated execution of MapRe-
duce jobs is needed until a fixpoint is reached, that is, no new triples are generated.
In WebPie [Urbani et al., 2009, 2012], three optimization techniques are proposed to
achieve a fixpoint as soon as possible. The first one starts by the observation that in each RDFS
rule with a two-triples body, one of the two is always a schema triple. Since RDF schemas are
usually much smaller than RDF datasets, the authors propose to replicate the schema at each
node and keep it in memory. Then, each rule can be applied directly either in the map phase or
in the reduce phase of a job, given that the schema is available at each node.
The second optimization consists of applying rules in the reduce phase to take advantage
of triple grouping and thus avoid redundant data generation. If the rules are applied in the
map phase, many redundant triples can be generated. For example, Figure 6.2 shows that for
the application of rule i3 , in the map phase, the same triple is produced three times. On the
6.1. REASONING THROUGH RDFS CLOSURE COMPUTATION 69
p1 domain c1
s1 p1 o1 map s1 type c1
s1 p1 o2 map s1 type c1
s1 p1 o3 map s1 type c1
(a) Map-side
p1 domain c1
s1 p1 o1 map s1 p1
s1 p1 o3 map s1 p1
(b) Reduce-side
Figure 6.2: Map-side and reduce-side application of rule i3 of Table 2.2 when schema triples
are kept in memory.
other hand, Figure 6.2 demonstrates how rule i3 can be applied in the reduce phase causing no
redundancy.
Finally, in Urbani et al. [2009, 2012] the authors propose an application order for RDFS
rules based on their interdependencies so that the required number of MapReduce cycles is
minimized. For example, rule i3 depends on rule i1 ; output triples of i1 are input triples of i3 .
Thus, it is more efficient to first apply rule i1 and then i3 . Thus, the authors show that one can
process each rule only once and obtain the RDFS closure with the minimum number of MapReduce
jobs.
At the same time as Urbani et al. [2009], the authors of Weaver and Hendler [2009]
present a similar method for computing the RDFS closure based on the complete set of entail-
ment rules of W3C [2014a] in a parallel way using a message passing interface (MPI). In Weaver
and Hendler [2009] they show that the full set of RDFS rules of W3C [2014a] has certain
properties that allow for an embarrassingly parallel algorithm, meaning that the interdependen-
cies between the rules can easily be handled by ordering them appropriately. This means that
the RDFS reasoning task can be divided into completely independent tasks that can be executed
in parallel by separate processes. Similarly, with Urbani et al. [2009], each process has access to
all schema triples, while data triples are split equally among the processes, and reasoning takes
place in parallel.
70 6. RDFS REASONING IN THE CLOUD
The authors of Urbani et al. [2009] have extended WebPie in Urbani et al. [2010] to
enable the closure computation based on the OWL Horst rules [ter Horst, 2005a]. Finally, a
similar proposal with the above works is proposed later on in Cichlid [Gu et al., 2015], with
the difference that the authors implement their rule engine in Spark instead of MapReduce and
can, thus, achieve better performance.
Existing proposals from the literature combine the above reasoning approaches, that is, they
precompute entailed triples for some part of the RDF data, while reformulation may still be
performed at query time.
A common technique in this area is to precompute the RDFS closure of the RDF schema
so that query reformulation can be made faster. This works well because the RDF schema is usu-
ally very small compared to the data, it seldom changes, and it is always used for the RDFS rea-
soning process. This approach is followed in Crainiceanu et al. [2012] and Urbani et al. [2011a].
Rya [Crainiceanu et al., 2012] computes the entailed triples of the RDF schema in
MapReduce after loading the RDF data into the key-value store, where the RDFS closure is
also stored. One MapReduce cycle is used for each level of the subclass hierarchy.
In QueryPie [Urbani et al., 2011a], the authors focus on the parallel evaluation of single
triple-pattern queries according to OWL Horst entailment rules [ter Horst, 2005a], which is a
superset of the RDFS entailment rules. They build and-or trees where the or level is used for the
rules and the and level is used for the rules’ antecedents. The root of the tree is the query triple
pattern. To improve performance, entailed schema triples are precomputed so that the and-or
tree can be pruned. The system is built on top of Ibis [Bal et al., 2010], a framework which
facilitates the development of parallel distributed applications.
Another hybrid approach that is introduced in Kaoudi and Koubarakis [2013] for struc-
tured overlay networks and can be deployed in a cloud is the use of the magic-sets-rule rewriting
algorithm [Bancilhon et al., 1986]. The basic idea is that, given a query, rules are rewritten us-
ing information from the query so that the precomputation of entailed triples generates only
the triples required by the query. The benefit of using the new rules in the bottom-up evalua-
tion is that it focuses only on data which is associated with the query and hence no unnecessary
information is generated. Such a technique is particularly helpful in application scenarios where
knowledge about the query workload is available, and therefore only the triples needed by the
workload are precomputed and stored.
72 6. RDFS REASONING IN THE CLOUD
Table 6.1: Comparison of reasoning techniques
6.4 SUMMARY
Table 6.1 summarizes the works that either focus on RDFS reasoning or provide support for it.
The table spells out the reasoning method implemented in each system, the underlying frame-
work on which reasoning takes place, the fragment of entailment rules supported, and the type
of queries supported for query answering (if applicable). The works Urbani et al. [2009, 2010],
Weaver and Hendler [2009], which focus on RDFS closure computation, do not consider query
answering; one could deploy in conjunction with any of them any of the query processing algo-
rithms presented in Chapter 4.
73
CHAPTER 7
Concluding Remarks
RDF has been successfully used to encode heterogeneous semi-structured data in a variety of
application contexts. Efficiently processing large volumes of RDF data defeats the possibilities
of a single centralized system. In this context, recent research has sought to take advantage of the
large-scale parallelization possibilities provided by the cloud while also enjoying features such
as automatic scale-up and scale-down of resource allocation as well as some level of resilience to
system failures.
In this book we have studied the state of the art in RDF data management in a cloud en-
vironment as well as the most recent advances of RDF data management in parallel/distributed
architectures that were not necessarily intended for the cloud, but can easily be deployed therein.
We described existing systems and proposals which can handle large volumes of RDF data by
classifying them along different dimensions and highlighting their benefits and limitations. The
four dimensions we investigated are data storage, query processing, query optimization, and rea-
soning. These are the fundamental aspects one should first consider when handling RDF data.
There are other aspects of RDF data management, such as data profiling [Kruse et al., 2016]
and statistical learning [Nickel et al., 2016], which we did not focus on in this book.
Overall, we observed a great density of systems using MapReduce-style frameworks and
DFS, as well as key-value systems with local clients for performing more complex tasks, pre-
sumably because such underlying infrastructures are very easy to use. Among all possibilities for
building an RDF store, each option has its trade-offs. For instance, key-value stores offer a fine-
granular indexing mechanism allowing very fast triple pattern matching; however, they do not
rival the parallelism offered by MapReduce-style frameworks for processing efficiently queries,
and, thus, most systems perform joins locally at a single site. Although this approach may be
efficient for very selective queries with few intermediate results, it is not scalable for analytical-
style queries which need to access big portions of RDF data. For the latter, a MapReduce-style
framework is more appropriate, especially if a fully parallel query plan is employed. Approaches
based on centralized RDF stores are well suited for star-join queries, since triples sharing the
same subject are typically grouped on the same site. Thus, the centralized RDF engine available
on that site can be leveraged to process efficiently the query for the respective data subset; over-
all, such queries are efficiently evaluated by the set of single-site engines working in parallel. In
contrast, path queries which need to traverse the subsets of the RDF graph stored at distinct
sites involve more communications between machines and thus their evaluation is less efficient.
Finally, it may be worth noting that a parallel processor built out of a set of single-site ones
74 7. CONCLUDING REMARKS
leaves open issues such as fault tolerance and load balancing, issues which are implicitly handled
by frameworks such as MapReduce. Considering the variety of requirements (point queries vs.
large analytical ones, star vs. chain queries, updates, etc.), a combination of techniques, perhaps
with some adaptive flavor taking into account the characteristics of a particular RDF data set
and workload, is likely to lead to the best performance overall.
We currently find numerous open problems as the research area of parallel RDF data man-
agement has only been around for a few years. RDFS reasoning is an essential functionality of
the RDF data model and it needs to be taken into account for RDF stores to provide correct and
complete query answers. While some works investigated the parallelization of the RDFS clo-
sure computation, the area of query reformulation is so far unexplored in a parallel environment
for conjunctive RDF queries. Query reformulation can benefit from techniques for multi-query
optimization like the ones proposed in Elghandour and Aboulnaga [2012], Wang and Chan
[2013], and scan sharing [Kim et al., 2012], but adapting such techniques for RDF data is not
straightforward.
More recently we noticed a great amount of work studying optimization techniques such
as query decomposition and join ordering. Query optimization is vital to achieve great system
performance. Still, these works are still in their infancy and most of them neglect the RDFS
reasoning aspect that may happen during query processing and, thus, affect the optimization
algorithm.
In addition, current works focus only on the conjunctive fragment of SPARQL. Although
this is the first step toward RDF query processing, SPARQL allows for much more expressive
queries, e.g., queries including optional clauses, aggregations, and property paths.1 New frame-
works for RDF style analytics have appeared [Colazzo et al., 2014], which may naturally be
adapted to a large-scale (cloud) context. Evaluating such queries in a parallel environment is
still an open issue.
1 http://www.w3.org/TR/sparql11-property-paths/
75
Bibliography
D. J. Abadi, A. Marcus, S. Madden, and K. J. Hollenbach. Scalable semantic web data
management using vertical partitioning. In C. Koch, J. Gehrke, M. N. Garofalakis, D.
Srivastava, K. Aberer, A. Deshpande, D. Florescu, C. Y. Chan, V. Ganti, C. Kanne, W.
Klas, and E. J. Neuhold, Eds., Proc. of the 33rd International Conference on Very Large Data
Bases, pp. 411–422, ACM, University of Vienna, Austria, September 23–27, 2007. http:
//www.vldb.org/conf/2007/papers/research/p411-abadi.pdf 26, 37
D. J. Abadi, A. Marcus, S. Madden, and K. Hollenbach. SW-Store: A vertically partitioned
DBMS for semantic web data management. VLDB Journal, 18(2):385–406, 2009. DOI:
10.1007/s00778-008-0125-y 2
I. Abdelaziz, R. Harbi, S. Salihoglu, P. Kalnis, and N. Mamoulis. Spartex: A vertex-centric
framework for RDF data analytics. PVLDB, 8(12):1880–1891, 2015. http://www.vldb.o
rg/pvldb/vol8/p1880-abdelaziz.pdf DOI: 10.14778/2824032.2824091 51
I. Abdelaziz, R. Harbi, Z. Khayyat, and P. Kalnis. A survey and experimental
comparison of distributed SPARQL engines for very large RDF data. PVLDB,
10(13):2049–2060, 2017a. http://www.vldb.org/pvldb/vol10/p2049-abdelaziz.pdf
DOI: 10.14778/3151106.3151109 42, 54, 55, 65
I. Abdelaziz, R. Harbi, S. Salihoglu, and P. Kalnis. Combining vertex-centric graph process-
ing with SPARQL for large-scale RDF data analytics. IEEE Transactions on Parallel and
Distributed Systems, 28(12):3374–3388, December 2017b. DOI: 10.1109/tpds.2017.2720174
51, 52
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. http:
//www-cse.ucsd.edu/users/vianu/book.html 5
S. Abiteboul, I. Manolescu, N. Polyzotis, N. Preda, and C. Sun. XML processing in DHT
networks. In G. Alonso, J. A. Blakeley, and A. L. P. Chen, Eds., Proc. of the 24th International
Conference on Data Engineering, ICDE, pp. 606–615, IEEE, Cancún, México, April 7–12,
2008. DOI: 10.1109/ICDE.2008.4497469 31
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB:
An architectural hybrid of MapReduce and DBMS technologies for analytical workloads.
PVLDB, 2(1):922–933, 2009. http://www.vldb.org/pvldb/2/vldb09-861.pdf DOI:
10.14778/1687627.1687731 37
76 BIBLIOGRAPHY
Accumulo. Apache Accumulo. http://accumulo.apache.org/, 2008. 14, 30
P. Adjiman, F. Goasdoué, and M. Rousset. SomeRDFSin the semantic web. Journal of Data
Semantics, 8:158–181, 2007. DOI: 10.1007/978-3-540-70664-9_6 12
S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis, K. Tolle, B. Amann, I. Fundu-
laki, M. Scholl, and A. Vercoustre. Managing RDF metadata for community webs. In S. W.
Liddle, H. C. Mayr, and B. Thalheim, Eds., Conceptual Modeling for E-Business and the Web,
ER Workshops on Conceptual Modeling Approaches for E-Business and the World Wide Web and
Conceptual Modeling, Salt Lake City, UT, October 9–12, 2000, Proceedings, volume 1921 of
Lecture Notes in Computer Science, pp. 140–151, Springer, 2000. DOI: 10.1007/3-540-45394-
6_13 38
S. Alexaki, V. Christophides, G. Karvounarakis, D. Plexousakis, and K. Tolle. The ICS-
FORTH RDFSuite: Managing umi(u):s RDF description bases. In SemWeb, 2001. http:
//CEUR-WS.org/Vol-40/alexakietal.pdf 38
D. J. Beckett. The design and implementation of the Redland RDF application framework. In
V. Y. Shen, N. Saito, M. R. Lyu, and M. E. Zurko, Eds., Proc. of the 10th International World
Wide Web Conference, WWW 10, pp. 449–456, ACM, Hong Kong, China, May 1–5, 2001.
DOI: 10.1145/371920.372099. 38
T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34–43,
May 2001. DOI: 10.1038/scientificamerican0501-34 1
J. Broekstra and A. Kampman. Sesame: A generic architecture for storing and querying RDF
and RDF schema. In ISWC, 2002. DOI: 10.1002/0470858060.ch5 2
J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing
and querying RDF and RDF schema. In I. Horrocks and J. A. Hendler, Eds., The Semantic
Web—ISWC, 1st International Semantic Web Conference, Proceedings, Sardinia, Italy, June 9–
12, 2002, volume 2342 of Lecture Notes in Computer Science, pp. 54–68, Springer, 2002. DOI:
10.1007/3-540-48005-6_7 38
F. Bugiotti, F. Goasdoué, Z. Kaoudi, and I. Manolescu. RDF data management in the Amazon
cloud. In D. Srivastava and I. Ari, Eds., Proc. of the Joint EDBT/ICDT Workshops, pp. 61–72,
ACM, Berlin, Germany, March 30, 2012. DOI: 10.1145/2320765.2320790 39, 40, 45, 54,
55
L. He, B. Shao, Y. Li, H. Xia, Y. Xiao, E. Chen, and L. J. Chen. Stylus: A strongly-
typed store for serving massive RDF data. PVLDB, 11(2):203–216, 2017. DOI:
10.14778/3149193.3149200 33, 40, 52, 53, 54, 55
K. Hose and R. Schenkel. WARP: Workload-aware replication and partitioning for RDF. In
C. Y. Chan, J. Lu, K. Nørvåg, and E. Tanin, Eds., Workshops Proceedings of the 29th IEEE
International Conference on Data Engineering, ICDE, pp. 1–6, Computer Society, Brisbane,
Australia, April 8–12, 2013a. DOI: 10.1109/ICDEW.2013.6547414 34, 37, 40, 42, 54, 55,
60, 66
K. Hose and R. Schenkel. WARP: Workload-aware replication and partitioning for RDF. In
DESWEB, 2013b. DOI: 10.1109/icdew.2013.6547414 47, 51
J. Huang, D. J. Abadi, and K. Ren. Scalable SPARQL querying of large RDF graphs. PVLDB,
4(11):1123–1134, 2011. http://www.vldb.org/pvldb/vol4/p1123-huang.pdf 18, 34,
35, 36, 37, 40, 47, 53, 54, 55, 60, 61
Z. Kaoudi and M. Koubarakis. Distributed RDFS reasoning over structured overlay networks.
Journal on Data Semantics, 2013. DOI: 10.1007/s13740-013-0018-0 2, 67, 71
Z. Kaoudi and I. Manolescu. RDF in the clouds: A survey. VLDB Journal, 24(1):67–91, 2015.
DOI: 10.1007/s00778-014-0364-z 18
Z. Kaoudi, I. Miliaraki, and M. Koubarakis. RDFS reasoning and query answering on top
of DHTs. In A. P. Sheth, S. Staab, M. Dean, M. Paolucci, D. Maynard, T. W. Finin, and
K. Thirunarayan, Eds., The Semantic Web—ISWC, 7th International Semantic Web Conference,
ISWC, Karlsruhe, Germany, October 26–30, Proceedings, volume 5318 of Lecture Notes in
Computer Science, pp. 499–516, Springer, 2008. DOI: 10.1007/978-3-540-88564-1_32 12
82 BIBLIOGRAPHY
Z. Kaoudi, K. Kyzirakos, and M. Koubarakis. SPARQL query optimization on top of DHTs.
In P. F. Patel-Schneider, Y. Pan, P. Hitzler, P. Mika, L. Zhang, J. Z. Pan, I. Horrocks, and B.
Glimm, Eds., The Semantic Web—ISWC—9th International Semantic Web Conference, ISWC,
Shanghai, China, November 7–11, Revised Selected Papers, Part I, volume 6496 of Lecture
Notes in Computer Science, pp. 418–435, Springer, 2010. DOI: 10.1007/978-3-642-17746-
0_27 2
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partition-
ing irregular graphs. SIAM Journal on Scientific Computing, 20(1):359–392, 1998. DOI:
10.1137/s1064827595287997 24, 34, 38
H. Kim, P. Ravindra, and K. Anyanwu. From SPARQL to MapReduce: The journey using a
nested triplegroup algebra. PVLDB, 4(12):1426–1429, 2011. http://www.vldb.org/pvl
db/vol4/p1426-kim.pdf 18, 40
H. Kim, P. Ravindra, and K. Anyanwu. Scan-sharing for optimizing RDF graph pat-
tern matching on MapReduce. In IEEE Conference on Cloud Computing, 2012. DOI:
10.1109/cloud.2012.14 74
M. König, M. Leclère, M. Mugnier, and M. Thomazo. Sound, complete and minimal UCQ-
rewriting for existential rules. Semantic Web, 6(5):451–475, 2015. DOI: 10.3233/SW-140153
12
G. Ladwig and A. Harth. CumulusRDF: Linked data management on nested key-value stores.
In The 7th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS),
p. 30, 2011. 30, 31, 40, 48, 54, 55
K. Lee and L. Liu. Scaling queries over big RDF graphs with semantic hash partitioning.
PVLDB, 6(14):1894–1905, 2013. http://www.vldb.org/pvldb/vol6/p1894-lee.pdf
DOI: 10.14778/2556549.2556571 18, 34, 36, 40, 42
BIBLIOGRAPHY 83
F. Li, B. C. Ooi, M. T. Özsu, and S. Wu. Distributed data management using MapReduce.
ACM Computing Surveys, 46(3):31, 2014. DOI: 10.1145/2503009 18, 50, 64
B. McBride. Jena: A semantic web toolkit. IEEE Internet Computing, 6(6):55–59, 2002. http:
//doi.ieeecomputersociety.org/10.1109/MIC.2002.1067737 38
G. Moerkotte and T. Neumann. Analysis of two existing and one new dynamic program-
ming algorithm for the generation of optimal bushy join trees without cross products. In
Proc. of the 32nd International Conference on Very Large Data Bases, VLDB’06, pp. 930–941,
VLDB Endowment, 2006. http://dl.acm.org/citation.cfm?id=1182635.1164207
DOI: 10.5555/1182635.1164207 65
G. Moerkotte and T. Neumann. Dynamic programming strikes back. In SIGMOD, pp. 539–
552, 2008. DOI: 10.1145/1376616.1376672 65
S. Muñoz, J. Pérez, and C. Gutierrez. Simple and efficient minimal RDFS. Web Semantic,
7(3):220–234, September 2009. DOI: 10.1016/j.websem.2009.07.003 9
T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF
queries with multiple joins. In Proc. of the 27th International Conference on Data Engineering,
ICDE, pp. 984–994, IEEE Computer Society, Hannover, Germany, April 11–16, 2011. DOI:
10.1109/ICDE.2011.5767868 33
T. Neumann and G. Weikum. The RDF-3X engine for scalable management of RDF data.
VLDBJ, 19(1), 2010a. DOI: 10.1007/s00778-009-0165-y 29, 36, 47
T. Neumann and G. Weikum. The RDF-3X engine for scalable management of RDF data.
VLDBJ, 19(1), 2010b. DOI: 10.1007/s00778-009-0165-y 2
M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for
knowledge graphs. Proc. of the IEEE, 104(1):11–33, 2016. DOI: 10.1109/jproc.2015.2483592
73
84 BIBLIOGRAPHY
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign
language for data processing. In Proc. of the ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD, pp. 1099–1110, Vancouver, BC, Canada, June 10–12, 2008. DOI:
10.1145/1376616.1376726 48, 60
A. Owens, A. Seaborne, and N. Gibbins. Clustered TDB: A clustered triple store for Jena,
2008. http://eprints.ecs.soton.ac.uk/16974/ 2
N. Papailiou, I. Konstantinou, D. Tsoumakos, and N. Koziris. H2RDF: Adaptive query pro-
cessing on RDF data in the cloud. In A. Mille, F. L. Gandon, J. Misselis, M. Rabinovich, and
S. Staab, Eds., Proc. of the 21st World Wide Web Conference, WWW, pp. 397–400, ACM, Lyon,
France, April 16–20, 2012 (Companion Volume), 2012. DOI: 10.1145/2187980.2188058
18, 29, 30, 31, 40, 52, 54, 55, 66
N. Papailiou, I. Konstantinou, D. Tsoumakos, P. Karras, and N. Koziris. H2RDF+: High-
performance distributed joins over large-scale RDF graphs. In X. Hu, T. Y. Lin, V. Raghavan,
B. W. Wah, R. A. Baeza-Yates, G. Fox, C. Shahabi, M. Smith, Q. Yang, R. Ghani, W. Fan,
R. Lempel, and R. Nambiar, Eds., Proc. of the IEEE International Conference on Big Data,
pp. 255–263, October 6–9, Santa Clara, CA, 2013. DOI: 10.1109/BigData.2013.6691582
18, 29, 30, 31, 40, 49, 50, 59, 60, 65
N. Papailiou, D. Tsoumakos, I. Konstantinou, P. Karras, and N. Koziris. H2RDF+: An efficient
data management system for big RDF graphs. In C. E. Dyreson, F. Li, and M. T. Özsu, Eds.,
International Conference on Management of Data, SIGMOD, pp. 909–912, ACM, Snowbird,
UT, June 22–27, 2014. DOI: 10.1145/2588555.2594535 18
N. Papailiou, D. Tsoumakos, P. Karras, and N. Koziris. Graph-aware, workload-adaptive
SPARQL query caching. In Proc. of the ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD’15, pp. 1777–1792, 2015. DOI: 10.1145/2723372.2723714 65,
66
P. Peng, L. Zou, L. Chen, and D. Zhao. Query workload-based RDF graph fragmentation
and allocation. In Proc. of the 19th International Conference on Extending Database Technology,
EDBT, pp. 377–388, Bordeaux, France, March 15–16, OpenProceedings.org, 2016a. DOI:
10.5441/002/edbt.2016.35 37
P. Peng, L. Zou, M. T. Özsu, L. Chen, and D. Zhao. Processing SPARQL queries over dis-
tributed RDF graphs. VLDB Journal, 25(2):243–268, 2016b. DOI: 10.1007/s00778-015-
0415-0 37, 40, 52, 53
M. Przyjaciel-Zablocki, A. Schätzle, T. Hornung, C. Dorner, and G. Lausen. Cascading map-
side joins over HBase for scalable join processing. CoRR, abs/1206.6293, 2012. http://ar
xiv.org/abs/1206.6293 31, 40, 50, 54, 55
BIBLIOGRAPHY 85
R. Punnoose, A. Crainiceanu, and D. Rapp. Rya: A scalable RDF triple store for the clouds.
In J. Darmont and T. B. Pedersen, Eds., 1st International Workshop on Cloud Intelligence
(colocated with VLDB), Cloud-I’12, p. 4, ACM, Istanbul, Turkey, August 31, 2012. DOI:
10.1145/2347673.2347677 30, 31, 40
G. Raschia, M. Theobald, and I. Manolescu. Proc. of the 1st International Workshop on Open Data
(WOD), 2012. 1
P. Ravindra, H. Kim, and K. Anyanwu. An intermediate algebra for optimizing RDF graph
pattern matching on MapReduce. In G. Antoniou, M. Grobelnik, E. P. B. Simperl, B.
Parsia, D. Plexousakis, P. D. Leenheer, and J. Z. Pan, Eds., The Semanic Web: Research and
Applications—8th Extended Semantic Web Conference, ESWC, Heraklion, Crete, Greece, May
29–June 2, 2011, Proceedings, Part II, volume 6644 of Lecture Notes in Computer Science,
pp. 46–61, Springer, 2011. DOI: 10.1007/978-3-642-21064-8_4 18, 26, 40, 44, 54, 55,
59, 60, 61, 66
S. Salihoglu and J. Widom. GPS: A graph processing system. In Proc. of the 25th International
Conference on Scientific and Statistical Database Management, SSDBM, pp. 22:1–22:12, 2013.
DOI: 10.1145/2484838.2484843 51
W3C. Resource description framework (RDF): Concepts and abstract syntax. http://www.
w3.org/TR/2004/REC-rdf-concepts-20040210/, 2004. 1
W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview (Second
Edition). W3C Recommendation, https://www.w3.org/TR/owl2-overview/, 2012. 1
J. Weaver and J. A. Hendler. Parallel materialization of the finite RDFS closure for hundreds
of millions of triples. In A. Bernstein, D. R. Karger, T. Heath, L. Feigenbaum, D. Maynard,
E. Motta, and K. Thirunarayan, Eds., The Semantic Web—ISWC, 8th International Semantic
Web Conference, Proceedings, Chantilly, VA, October 25–29, volume 5823 of Lecture Notes in
Computer Science, pp. 682–697, Springer, 2009. DOI: 10.1007/978-3-642-04930-9_43 69,
72
C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple indexing for semantic web data man-
agement. PVLDB, 1(1):1008–1019, 2008a. http://www.vldb.org/pvldb/1/1453965.pd
f DOI: 10.14778/1453856.1453965 29
C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple indexing for semantic web data
management. PVLDB, 1(1):1008–1019, 2008b. DOI: 10.14778/1453856.1453965 2
K. Wilkinson, C. Sayers, H. A. Kuno, and D. Raynolds. Efficient RDF storage and retrieval in
Jena2. In SWDB (in conjunction with VLDB), 2003. 2
B. Wu, H. Jin, and P. Yuan. Scalable SAPRQL querying processing on large RDF data in cloud
computing environment. In ICPCA/SWS, 2012. DOI: 10.1007/978-3-642-37015-1_55 37
BIBLIOGRAPHY 89
B. Wu, Y. Zhou, P. Yuan, L. Liu, and H. Jin. Scalable SPARQL querying using path parti-
tioning. In J. Gehrke, W. Lehner, K. Shim, S. K. Cha, and G. M. Lohman, Eds., 31st IEEE
International Conference on Data Engineering, ICDE, pp. 795–806, Computer Society, Seoul,
South Korea, April 13–17, 2015. DOI: 10.1109/ICDE.2015.7113334 18, 34, 36, 40
B. Wu, Y. Zhou, H. Jin, and A. Deshpande. Parallel SPARQL query optimization. In 33rd
IEEE International Conference on Data Engineering, ICDE, pp. 547–558, Computer Society,
San Diego, CA, April 19–22, 2017. DOI: 10.1109/ICDE.2017.110 64, 66
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster
computing with working sets. In E. M. Nahum and D. Xu, Eds., 2nd USENIX Work-
shop on Hot Topics in Cloud Computing, HotCloud’10, USENIX Association, Boston, MA,
June 22, 2010. https://www.usenix.org/conference/hotcloud-10/spark-cluster-
computing-working-sets 18
K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributed graph engine for web scale
RDF data. PVLDB, 6(4):265–276, 2013. http://www.vldb.org/pvldb/vol6/p265-zeng
.pdf DOI: 10.14778/2535570.2488333 51, 54, 55, 65
X. Zhang, L. Chen, and M. Wang. Towards efficient join processing over large RDF graph
using MapReduce. In SSDBM, 2012. DOI: 10.1007/978-3-642-31235-9_16 26, 40, 44, 50,
54, 55
X. Zhang, L. Chen, Y. Tong, and M. Wang. EAGRE: Towards scalable I/O efficient SPARQL
query evaluation on the cloud. In ICDE, 2013. DOI: 10.1109/icde.2013.6544856 28, 40,
45, 54, 55
L. Zou, J. Mo, L. Chen, M. T. Özsu, and D. Zhao. gStore: Answering SPARQL queries via
subgraph matching. PVLDB, 4(8):482–493, 2011. http://www.vldb.org/pvldb/vol4/
p482-zou.pdf DOI: 10.14778/2002974.2002976 37
Authors’ Biographies
ZOI KAOUDI
Zoi Kaoudi is a Senior Researcher in the DIMA group at the Technische Universität
Berlin (TUB). She has previously worked as a Scientist in the Qatar Computing Research Insti-
tute (QCRI) of the Hamad Bin Khalifa University in Qatar, in IMIS-Athena Research Center
as a research associate, and Inria as a postdoctoral researcher. She received her Ph.D. from the
National and Kapodistrian University of Athens in 2011. Her research interests include cross-
platform data processing, machine learning systems, and distributed RDF query processing and
reasoning. Recently she has been the proceedings chair of EDBT 2019, co-chaired the TKDE
poster track co-located with ICDE 2018, and co-organized the MLDAS 2019 held in Qatar.
She has co-authored articles in both database and Semantic Web communities and served as a
member of a Program Committee for several international database conferences.
IOANA MANOLESCU
Ioana Manolescu is a senior Inria researcher, and the lead of the CEDAR team (joint between
Inria Saclay and the LIX lab of École polytechnique) in France. The CEDAR team research
focuses on rich data analytics at cloud scale. Ioana is a member of the PVLDB Endowment
Board of Trustees and has served for four years (including as president) of the ACM SIGMOD
Jim Gray Ph.D. dissertation committee. Recently, she has been a general chair of the IEEE
ICDE 2018 conference, an associate editor for PVLDB 2017 and 2018, and the program chair
of SSDBBM 2016. She has co-authored more than 130 articles in international journals and
conferences, and contributed to several books. Her main research interests include data models
and algorithms for computational fact-checking, performance optimizations for semistructured
data and the Semantic Web, and distributed architectures for complex large data.
STAMATIS ZAMPETAKIS
Stamatis Zampetakis is an R&D engineer at TIBCO Orchestra Networks and a PMC member
of Apache Calcite. Previously, he was a postdoctoral researcher at Inria, from where he also
received his Ph.D. in 2015. Before that, he worked in FORTH-ICS as a research assistant.
His research interests are in the broad area of query optimization with emphasis on RDF query
processing and visualization.