Querying The Linked Data Graph Using Owl:Sameas Provenance
Querying The Linked Data Graph Using Owl:Sameas Provenance
Querying The Linked Data Graph Using Owl:Sameas Provenance
owl:sameAs Provenance
Sam Coppens, Ruben Verborgh Erik Mannens, and Rik Van de Walle
Abstract. Querying the Linked Open Data cloud as a whole still re-
mains problematic. Prior knowledge is required to federate the queries
to the appropriate datasets: each dataset provides its own SPARQL end-
point to query that dataset, and each dataset uses its own vocabulary
to describe their information. In this paper, we propose a federated and
asynchronous SPARQL framework to query the global data space. Our
query federation is based on the fact that owl:sameAs relationships are
symmetrical and transitive and, hence, that such linked resources are in-
terchangeable. The provenance of the generated owl:sameAs links during
the enrichment process is processed to resolve the SPARQL endpoints of
linked datasets and their vocabulary mappings. This information is used
during the query federation. Asynchronous SPARQL processing makes
the query federation more scalable and exploits better web caching.
Keywords: Distributed, Linked Open Data, sameAs, SPARQL
1 Introduction
The main goal of Linked Open Data is to create a global data space, the Linked
Open Data cloud1 [1], where end users can easily discover and consume data.
Crucial for Linked Open Data is link generation. Link generation will discover
relationships between the resources being published and the resources already
being described in the Linked Open Data cloud using the SPARQL protocol[7]
. This step will link the data being published with information of other data
sources, turning the disparate set of data sources into one global data space.
Bizer et al. [2] explain very clearly the concept and technical principles of Linked
Data and why link discovery between datasets is crucial in creating a global data
space.
At the moment, querying this global data space as a whole remains a chal-
lenge. If we want to get results from a query that ranges over multiple data
sources, accessible through different SPARQL endpoints, we will have to split
up the query into subqueries for each SPARQL endpoint and combine the re-
sults or we will have to fetch firstly all the different data sources into one data
source and then query this aggregated data source. Both approaches have some
limitations though:
1
http://linkeddata.org/
– In the first approach, we will have to know which part of the query is meant
for which SPARQL endpoint and we will need to know for each SPARQL
endpoint which vocabulary it uses, so that appropriate mappings to the
query can be applied.
– In the second approach, we don’t need to know which SPARQL endpoints
must be queried for receiving an appropriate answer, but we still need to
know which vocabulary to use in the query so that the aggregated data
sources can understand the query. This approach has the danger that in a
highly interlinked environment, a lot of information has to be indexed first.
2
http://jena.apache.org/documentation/query/
3
http://jena.apache.org/
3 Solution
The main idea behind our distribution framework is to rewrite incoming queries
in such a way that the symmetry and transitivity of owl:sameAs is exploited.
To achieve this, our framework traverses the owl:sameAs links of the resources
using SPARQL. Incoming queries are split up in subqueries, which can be eval-
uated independently. For each subquery, we look for possible owl:sameAs linked
resources. These subqueries are refactored to target also the remote SPARQL
endpoints of these owl:sameAs linked resources. At the same time, we apply vo-
cabulary mappings to the refactored subqueries to match the remote datasets’
vocabularies. Thus, to distribute the queries, some prior knowledge is needed:
– SPARQL endpoints: for each discovered related resource, we have to know
the SPARQL endpoint we can consult for retrieving information on the re-
lated resource.
– Vocabulary mappings: we cannot just distribute the same query to the dif-
ferent SPARQL endpoints, because every endpoint uses its own vocabulary
to describe things, thus we need appropriate mappings for disparate classes
and properties of the remote data sources. Some datasets use , e.g., rdfs:label
for denoting the name of a person, other datasets use , e.g., foaf:name. These
latter datasets needs to be queried using foaf:name instead of rdfs:label to
retrieve the person’s name.
This prior knowledge can be retrieved from the provenance of the owl:sameAs
links. The provenance of these owl:sameAs links can be expressed as rules in
the form of SPARQL construct queries, which will be explained in detail in
Section 4. This allows us to build a lookup table, called distribution index which
stores the service endpoints and for each service endpoint the property mappings
and the class mappings. This information can be extracted from the SPARQL
construct queries, representing the owl:sameAs links’ provenance. This index
building process takes place prior to the queries and not during the queries. It
is a preprocessing step. This lookup table is part of the distributed SPARQL
processor, which will use this information to answer its incoming queries and to
distribute the subqueries accordingly.
Our distribution framework is implemented as an extension to ARQ. It has
two main building blocks. The index builder processes the owl:sameAs links’
provenance to build up the distribution index. This is explained in Section 4. The
query distributor is the second main building block, responsible for federating
and mapping the incoming queries, using the distribution index. This block is
explained in Section 5.
4 Index Builder
The Index Builder is responsible for building the distribution index. This index
must be built before any queries are being fired. The index builder will take
provenance information of the owl:sameAs links to extract the services to which
it will distribute incoming queries and for each service it will extract possible
mappings.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX lons: <http://foo.org/localontologynamespace#>
CONSTRUCT {?resource owl:sameAs ?remoteresource}
WHERE {
?resource a owl:thing.
?resource lons:type "person".
?resource lons:name ?concept.
?concept skos:prefLabel ?name.
SERVICE <http://dbpedia.org/sparql>
{
?remoteresource a foaf:Person.
?remoteresource foaf:name ?remotename.
FILTER (str(?remotename) = ?name)
}
}
Many owl:sameAs links are generated based on some rule, representing the
provenance of a owl:sameAs link. Figure 1 shows how a owl:sameAs link, relat-
ing two persons, can be represented as a SPARQL construct query. The index
builder actually processes such rules to feed the distribution index. Our frame-
work implements two instances of this index builder as an extension to ARQ.
There are, thus, two ways of feeding the distribution index :
– A SPARQL index builder that is fed with SPARQL construct queries, rep-
resenting the provenance of the owl:sameAs links present in the dataset.
In fact, these rules (SPARQL construct queries) are the queries used for
interlinking the dataset. The SPARQL queries are compiled into SPARQL
algebra and the index builder algorithm, described below, is implemented
such that it operates on SPARQL algebraic expressions.
– A SILK index builder that takes as input a configuration file of SILK [6],
the link discovery framework4 . This configuration file has all the information
available to represent the provenance of the discovered links as SPARQL
construct queries. Hence, an index builder was also implemented for this
sort of input. This SILK index builder just parses the SILK configuration
file and directly fills up the index, because the SPARQL endpoint, class
mappings and property mappings for this SPARQL endpoint are directly
available from this configuration file.
4
http://www4.wiwiss.fu-berlin.de/bizer/silk/
4.1 Index Builder Algorithm
As explained, the provenance information of all the owl:sameAs links of a dataset
can be expressed as a number of SPARQL queries. The index builder iterates
through all these queries and for each query it does the following:
1. Extract the remote SPARQL endpoint and store it in the distribution index.
In our example this is < http : //dbpedia.org/sparql >.
2. Extract the query variables from the CONSTRUCT clause, expressing the
local resource (e.g., ?resource) and remote resource (e.g., ?remoteresource).
3. Extract the triple pattern for the local resource and the triple pattern for
the remote resource.
4. Extract possible FILTER expressions for the local resource and the possible
FILTER expressions for the remote resource.
5. Extract all different paths from both the local and remote triple patterns.
These paths can include FILTER expressions, thus for building the paths
also the extracted FILTER expressions are taken into account.
6. For every path, extracted from the remote triple pattern and FILTER expres-
sions, that ends in a query variable, find the corresponding path, extracted
from the local triple pattern and FILTER expressions, that ends with the
same query variable. The part of the remote path, starting from the remote
query variable to the query variable, and the part of the local path, starting
from the local query variable, are property mappings of each other, as shown
in the example below. This mapping is stored in the distribution index for
this SPARQL endpoint.
Property mapping 1:
?resource lons:name ?concept. ?concept skos:prefLabel ?name =
?remoteresource foaf:name ?remotename.
FILTER (str(?remotename) = ?name)
7. The remaining paths from the remote and local triple patterns and FILTER
expressions, make up the class mapping. This is also stored in the distribution
index for this SPARQL endpoint.
Class mapping 1:
?resource a owl:Thing. ?resource lons:type "person" =
?remoteresource a foaf:Person.
To summarise, we list here the information to be stored in the distribution index
using our index builder algorithm:
5 Query Distributor
The query distributor refactors the incoming query to become a distributed query
using the information of the distribution index built by our index builder. The
query will be transformed in such a way that all linked resources via owl:sameAs
become interchangeable, which aggregates all the information that directly or
indirectly is available on these resources. Thus, our framework actually makes
use of the fact that owl:sameAs is symmetrical and transitive.
This distribution of the incoming query is done during the query optimisation
in ARQ, thus after the incoming query has been compiled to a SPARQL algebraic
expression. The query distributor is implemented as a query transformation that
uses the distribution index to perform its transformations. It will search for triple
patterns that are not part of a SPARQL SERVICE element. Triple patterns
belonging to a SPARQL SERVICE operator are not altered, only the part of the
incoming query that is meant to be evaluated against the local data is affected by
the query distributor. The query distribution algorithm will distribute incoming
queries in two phases. First, the basic graph patterns (BGPs) of the query are
being distributed during the Transform BGP phase. Later on, these distributed
BGPs will be merged appropriately during the Merge BGP phase. The details
of the query distribution algorithm are discussed hereafter.
1. Extract the triples of the BGP to be evaluated against the local data (i.e.,
triples from triple patterns not part of a SPARQL SERVICE element).
2. Extract possible FILTER expressions affecting these extracted triple pat-
terns.
3. Extract all nodes from the BGP. These nodes can be URIs or query variables
(e.g. ?book, ?booktitle, ?author, and ?comment for BGP 1 from our example
depicted in Figure 2).
4. For every node, extract all possible paths (i.e. query paths) from the ex-
tracted triples and FILTER expressions. Thus, these query paths can include
FILTER expressions, apart from the triples. They can be seen as property
paths, extended with FILTER expressions.
5. Every extracted path from the previous step has a local variant (e.g. ?book
dbpediaont:writer ?author)and a remote variant (e.g. ?book owl:sameAs
?DQV0. SERVICE < http : //dbpedia.org/sparql > ?DQV0 dbpediaont:writer
?author), which are tied together using the UNION operator, as shown in
Figure 3. For the remote variant of the query path, the start node of that
query path is decoupled into the owl:sameAs equivalent. The owl:sameAs
equivalent is distributed to the SPARQL endpoints from the distribution
index and at the same time mappings from the distribution index for that
SPARQL endpoint are applied. The example depicted in Figure 3 shows the
query paths extracted from node ?book for the example query depicted in
Figure 2.
6. The outcome of the previous step are only the distributed query paths start-
ing from a certain node (e.g. ?book in our example). These blocks (i.e. se-
quence operators) still need to be tied together. This is done during this last
step using the UNION operator. This way, a BGP is split up in all possible
query paths. Figure 4 gives a schematic overview of how a BGP 1 from the
example depicted in Figure 2 is distributed. Each of the three right blocks
represent the outcome from the previous step, but each using a different
starting node.
Distributed query paths extracted with starting node ?book from BGP 1:
(sequence
(union
(bgp (triple ?book dbpediaont:writer ?author)) |Local QP 1
(sequence
(bgp (triple ?book owl:sameAs ?DQV0)) |Remote QP 1
(service <http://dbpedia.org/sparql> |
(bgp (triple ?DQV0 dbpediaont:writer ?author)))))|
(union
(bgp (triple ?book rdfs:label ?booktitle)) |Local QP 2
(sequence
(bgp (triple ?book owl:sameAs ?DQV1)) |Remote QP 2
(service <http://dbpedia.org/sparql> |
(bgp (triple ?DQV1 rdfs:label ?booktitle))))) |
(union
(bgp (triple ?book rdfs:comment ?comment)) |Local QP 3
(sequence
(bgp (triple ?book owl:sameAs ?DQV2)) |Remote QP 3
(service <http://dbpedia.org/sparql> |
(bgp (triple ?DQV2 rdfs:comment ?comment)))))) |
Fig. 3. Part of the outcome from step 5 of the Transform BGPs algorithm
The previous phase only distributed the BGPs. These distributed BGPs cannot
be merged as such. If we take back the example depicted in Figure 2 and consider
BGP 2, the nodes extracted from this BGP are only ?book and ?producer. If the
local dataset only has information on authors (which are linked to their DBpedia
owl:sameAs equivalent) and we want to have the producer of a book, there needs
to be a query path from ?author to ?producer. This query path is not extracted
from BGP 1, nor from BGP 2. For this, we need a special merging algorithm.
The result of such a merge is shown in Figure 5. This example only shows the
merged operator with query path starting with node ?book. The result is an
extra query path is merged using a LEFTJOIN operator. To achieve this, our
merging algorithm works as follows:
1. The triples from BGP 1 and BGP 2 are merged.
2. From the merged triples all nodes are extracted (e.g. ?book, ?booktitle, ?com-
ment, and ?producer).
3. For every node, all query paths are extracted.
4. If a query path from a certain node is not yet present in the distributed BGP
1, then the query path is merged with the already extracted query paths for
that node in the distributed BGP 1. This merge is done using the operator
that binds BGP 1 and BGP 2, i.e. LeftJoin operator in this case, which is
the algebraic equivalent of an OPTIONAL operator in SPARQL, as shown
in Figure 5.
Merged (BGP 1 and BGP 2), distributed query paths extracted with starting node ?book:
(leftjoin
(sequence operator shown in step 5 of the trnasform BGP algorithm )
(union
(bgp (triple ?book dbpediaont:writer ?producer))
(sequence
(bgp (triple ?book owl:sameAs ?DQV0))
(service <http://dbpedia.org/sparql>
(bgp (triple ?DQV0 dbpediaont:writer ?producer))))))
Fig. 5. Merging of distributed BGP 1 and BGP 2 - only that part with query paths
starting with ?book node
6 Optimisations
Mappings To the query path that are being federated, mappings are applied
to refactor the query path to the vocabularies the remote dataset supports.
This enhances interoperability of the query federation framework and makes our
framework more robust. The mappings are retrieved implicitly, by processing
the provenance of the sameAs links. As a consequence, the mappings are always
expressed in terms of the local vocabularies used. It can be seen either as a point-
to-point mapping strategy or a strategy using a universal data model, where the
universal data model is always the local data model from which the distributions
start.
7 Evaluation
For evaluating our distribution framework, we make use of the Berlin SPARQL
Benchmark. The dataset of this benchmark, consisting of 10 000 product de-
scriptions, is published over two SPARQL endpoints, i.e., a local and a remote
SPARQL endpoint. We have set up a local SPARQL endpoint, whose dataset
only contains product URIs, linked via owl:sameAs with the product resources
of the remote BSBM dataset. The BSBM querymix is fired at the local SPARQL
endpoint, which distributes the queries using our distribution framework to solve
the queries. For this, it contacts the remote SPARQL endpoint. The execution
times of the results are used for evaluation. These result times are evaluated
against the result times of the ’perfect’ distributed query. This ’perfect’ dis-
tributed query just forwards the whole query of the BSBM querymix to the
remote SPARQL endpoint using the SERVICE operator. This way, we evaluate
the result times of a query once using our distribution algorithm against the
result times of a query using a ’perfect’ query distribution. By doing this kind
of evaluation, our evaluation is also independent of the used hardware/software
and the size of the BSBM dataset and evaluate the distribution algorithm itself.
The table below shows the results for only a part of the queries of the BSBM
querymix. The other query results showed similar performance.
’perfect’ distribution algorithm Result time first result Result time last result
Query 1 780 ms 3763 ms
Query 3 2701 ms 4076 ms
Query 7 517 ms 603 ms
our distribution algorithm Result time first result Result time last result
Query 1 967 ms (+23%) 4480 ms (+19%)
Query 3 3594 ms (+22%) 4929 ms (+21%)
Query 7 631 ms (+22%) 862 ms (+42%)
8 Conclusion
9 Acknowledgments
The research activities that have been described in this paper were funded by
Ghent University, Interdisciplinary Institute for Broadband Technology (IBBT),
the Institute for the Promotion of Innovation by Science and Technology in
Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders),
and the European Union.
References
1. Bizer, C. and Heath, T. and Idehen, K. and Berners-Lee, T. Linked Data on the
Web. In Proceedings of the 17th International World Wide Web Conference –
LDOW Workshop, pages 1265–1266, Beijing, China, April 2008.
2. Bizer, C.;Heath, T. and Berners-Lee, T. Linked data - the story so far. Int. J.
Semantic Web Inf. Syst., 5(3):1–22, 2009.
3. O. Görlitz and S. Staab. Splendid: Sparql endpoint federation exploiting void
descriptions. In O. Hartig, A. Harth, and J. Sequeda, editors, COLD, volume 782
of CEUR Workshop Proceedings. CEUR-WS.org, 2011.
4. A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. U. Sattler, and J. Umbrich. Data
summaries for on-demand queries over linked data. In Proceedings of the 19th
international conference on World wide web, WWW ’10, pages 411–420. ACM,
2010.
5. Hartig, O.; Bizer, C. and Freytag, J. Executing SPARQL Queries over the Web of
Linked Data. In In Proceedings of the 8th International Semantic Web Conference
(ISWC), 2009.
6. Jentzsch, A.; Isele, R. and Bizer, C. Silk - generating rdf links while publish-
ing or consuming linked data. In 9th International Semantic Web Conference
(ISWC2010), November 2010.
7. Prud’hommeaux, E. and Seaborne, A., editor. SPARQL Query Language for RDF.
W3C Recommendation. World Wide Web Consortium, November 2007. Available
at http://www.w3.org/TR/rdf-sparql-query/.
8. Quilitz, B. and Leser, U. Querying distributed rdf data sources with sparql. In 5th
European Semantic Web Conference (ESWC2008), pages 524–538, June 2008.
9. A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. Fedx: A federation
layer for distributed query processing on linked open data. In ESWC (2), pages
481–486, 2011.
10. Tummarello, G.; Delbru, R. and Oren, E. Sindice.com: Weaving the open linked
data. In In Proceedings of the International Semantic Web Conference (ISWC),
2007.