tmpBEFB TMP
tmpBEFB TMP
99
Abstract: A flexible platform supporting the linked data life-cycle has been developed and applied in various use cases in the context of the large scale linked
open data project Fusepool P3. Besides the description of the aims and achievements, experiences from publishing and reusing linked data in public sector and
business are summarized. It is highlighted that without further help it is difficult
for domain experts to estimate the time, effort and necessary skills when trying to
transfer the platform to other use cases. Applying a new publishing methodology
turned out to be useful in these cases.
Keywords: linked data, semantic enrichment, linked data life-cycle, data publishing, data integration, resource description framework, data management
1. Introduction
The exploitation of the Internet for intelligent knowledge management has been
worked on for many years and it still remains one of the main challenges for the scientific community with added value for business, public bodies and civil society. In this
attempt, the web is not only used in a classical way for publishing (unstructured) documents as HTML pages, offering online services like shopping, booking or text-based
search engines, but also as a platform for processing and managing structured information. It appears in the form of data, which is published, interlinked and integrated
with other structured information as linked data [1], that can subsequently be browsed
or queried.
Annotated with appropriate vocabulary terms from ontologies, this interlinked
structured information can not only be searched by keywords, but on a semantic level,
thus laying the foundation for the Semantic Web [2]. Through linked data, information
and services on the Internet and in web-based applications and mobile apps can and
have already been enriched in a sophisticated way, although in the broad public it is not
yet noticed as a big bang, since it comes in form of a quiet revolution [3]. Facebooks
Knowledge Graph, Googles Hummingbird and Bings Satori are examples of improv1
100
ing services through semantic search technologies, revealing the revolutions silence
through incrementally improving the services in small iterations while digesting constantly information from different sources.
In the e-Government domain, the use of linked open data (LOD) is spreading, as
public authorities realize its benefits not only regarding the transparency of governmental processes, but also as a driver for economic innovation: the availability of machine-readable semantically enriched open data enables SMEs and other entities to
develop and provide new value-added services and applications. However, while public
authorities in democratic countries around the globe have already or are developing a
strategy for Open Government Data (OGD), only a fraction of those already take the
additional step of provisioning the data as LOD through SPARQL endpoints. Take for
example Switzerland: An e-Government strategy is in place both on the federal level
(since 2007) as in most cantons; in addition, an OGD portal2 as single point of entry for
all OGD data in Switzerland has been established in February 2016. A service platform
for LOD however is only available in a pilot stage with currently only a limited set of
data.3 One of the main roadblocks hindering a wider adoption of linked open data is
that authorities shy away from the additional effort needed to convert OGD to LOD.
This was also one of the key motivators to start the Fusepool P3 project.
Meanwhile, the Linked Data paradigm has fostered and propelled the emergence
of numerous research projects and software products with focus on LOD [4]. Currently,
the most prominent output of the LOD movement is visualized in the LOD cloud,4 the
core of which is formed by the data sets of DBpedia [5] and GeoNames.5 Moreover,
many domain-specific applications have evolved [6], often with an exploratory focus.
Inherent to LOD applications is the processing of data analogous to ETL processing in the data warehouse domain, but with more complex operations such as data
extraction, enrichment, interlinking, fusing and maintenance. While these can be automated to a certain degree for a specific domain, a lot of manual work is still necessary,
e.g., for mapping tasks. This data processing is part of the linked data life-cycle [7],
that occurs with different complexity, among others depending on the data sources and
the requirements of the target applications. In one way or another, the linked data lifecycle is integral in research projects like LOD2 [8], LATC [9], GeoKnow [10] and
Fusepool [11].
In this paper we describe experiences from Fusepool P3 [12], a large scale ECfunded FP7 project with a focus on publishing and reusing linked data. The research
goal was to develop enhanced products and services based on the exploitation of linked
data in the context of the tourism domain. In the next section, the project goals are
summarized, followed by a description of the architecture of the integrated data platform. Next, experiences from the project are pointed out, before concluding with aspects about the transfer of the research results to other application contexts.
http://opendata.swiss/
http://lindas-data.ch/
4
http://lod-cloud.net/
5
http://www.geonames.org/
3
101
Of the main findings we have learned that the Fusepool platform can significantly
simplify the publishing of data as linked open data. Regional authorities in Trento and
Tuscany were thus enabled to provide tourism-related data that form the basis for novel
applications. Reflecting several completed use cases showed that additional advice and
recommendations are essential for transferring project results to other use cases. A new
publishing methodology, described below, allows for recording information on completed LOD projects and helps estimating and planning new LOD use cases.
Figure 1: Elements of the Fusepool P3 data value chain; Fusepool derives its name from the idea of fusing
and pooling linked data with analytical processing on top of it, and P3 abbreviates Linked Data PublishProcess-Perform
http://fusepool.eu/
102
2.1 Architecture
We aim at providing a single platform for the linked data life-cycle. To achieve this,
the Fusepool platform architecture is based on loosely coupled components communicating via HTTP and exposing RESTful APIs exchanging RDF [14]. This leads to reusability of components, enables distributed development and makes it easier for developers to understand and extend the software, thus ensuring its longevity.
RESTful RDF is the platform's native interaction method, meaning that there are
no proprietary data access APIs in place. Platform components, as well as third party
applications, communicate using generic RDF APIs. In Fig. 2, the Fusepool platform
architecture is depicted.
Client Applications / Fusepool P3 Dashboard
Client Applications / Fusepool P3 Dashboard
Client Applications / Fusepool P3 Dashboard
REST
SPARQL
LDP 1.0
Transformer API
Transformer
Factory Registry
Custom
Services
SPARQL
Endpoint
LDP 1.0
Server
Transformer
Registry
User Interaction
Request API
User Interaction
Request Registry
Pipeline Transfomer
Single Transfomers
Key
Clients
Transformers
LDP Transforming Proxy
Backends
The diagram shows how the Fusepool P3 dashboard the main user interface to
interact with the platform and other clients access the Fusepool platform primarily via
an LDP Transforming Proxy, an extension of the LDP 1.0 specification which uses the
REST-based Transforming Container API to enable RDF data generation and annotation from input data. The proxy transparently handles transformation processes by calling the actual transformers in the background, and once the process has finished, it
sends back the data to the LDP Server. The clients can also directly access transformers
via their REST API (the Transformer API) or use a SPARQL 1.1 endpoint.
As a result, the architecture does not require a common runtime for its components.
Every component, including all transformers, is by default run as an individual process
acting via HTTP as the interaction interface. The exception to this are the backend
related components (LDP, SPARQL, the RDF Triple Store and possible custom
backend services) which may be more tightly coupled, i.e., they may be run in the same
103
104
former associated with it, should multiple transformers be executed over the POSTed
data.
The Transforming Container API is defined as an extension to the LDP specification to allow special containers to execute a transformer when a member resource is
added via a POST request. This allows documents to be automatically transformed
when they are added to a LDPC, and having both the original data and the transformed
data as a resource inside the Transforming LDPC. This process is supported via the
LDP Transforming Proxy.
The User Interaction Request API describes how an LDPC is used to maintain a
registry of requests for interaction. Its purpose is to provide support for components
which require user interaction during their lifetime, such as transformers requesting a
user input. According to the API, components submit a URI to the mentioned registry,
and remove the URI once the interaction is completed. A UI component can then provide the user with a link to the submitted URI. The component is free to present any
web-application at the denoted URI suitable for performing the required interaction.
Backends. The platform can use both Apache Marmotta and Virtuoso Universal Server
as backends, which provide the generic LDP and SPARQL interfaces and data persistence in an RDF Triple Store. However, based on the architectural approach, any other
tool which supports the LDP and/or the SPARQL standards can be used as the platform
backend as well.
3. Experiences
Our experiences with the Fusepool platform are best explained by the example of our
two initial stakeholders in the Fusepool P3 project, namely two touristic regions in
Italy: Provincia Autonoma di Trento (PAT) and Regione Toscana (RET). They have
been publishing open data and are supporting the development of applications and
services in the tourism domain for some time. During this time both partners gained
valuable experience in data creation, maintenance and publication.
3.1. Limitations in Publishing Open Data
PAT and RET first started publishing data sets which were considered strategic. In
Italy in general but also in the two regions Tuscany and Trentino, one of the most important businesses is tourism. This also includes linked and related industrial activities
around tourism. Thus the regions are struggling with one particular question: How can
they support and push tourism by changing their daily operations.
Both partners provide a CKAN based open data portal,7 which aims at data publishers providing tools to find and use data. The data quality depends on the data provider. Except adding some meta information, the data that gets pushed into the system
is the data which is made available to the user.
7
http://ckan.org/
105
At project start, open data from PAT and RET was only available in particular data
formats like CSV, KML, XML and JSON. App developers had to download the raw
data and process it using their own ETL processes. With every update of the raw data,
this process had to be triggered manually for every single application using this data. If
the format of the raw data changed, the process had to be adjusted and could not be
automated. With every new data source, maintenance complexity of these open data
sets and its apps increased.
3.2. Linked Data Life-Cycle
Reducing the complexity for consuming open data requires that the necessary ETL
work is done up-front, ideally by the data owner or someone with domain knowledge.
Furthermore, the data should preferably be published as a service and without the need
for running separate database servers and other services. This is where linked data and
its RDF technology stack come into play. With its open, non-proprietary data model
using W3C standards such as SPARQL and HTTP, RDF is used as Lingua Franca using well-known schemas and ontologies.
In the classic document-centric web not much is known about the relationship between two pages as links between them are untyped. RDF links far more granular entities than pages, i.e. single attributes of an object, and defines relations between data
items in schemas and ontologies. Best practices recommend publishing these schemas
and ontologies also as RDF, thus making them publicly available in a machine-readable
form.
3.3. Applying the Linked Data Life-Cycle
Experiences with applying the linked data life-cycle using the Fusepool platform were
made in preparation for and during a hackathon at the Spaghetti Open Data Event,8
where the initial versions of two linked open data applications based on data from the
Province of Trento were developed.
In the first one, a web application called LOD events eXplorer allows events in
the Trento region to be browsed, and information and pictures of nearby points of interests (POIs) are also shown (see Fig. 3). The developers could easily transform the
original data set provided as an XML feed into RDF using the XSLT transformer provided by the Fusepool platform and store the results in the data store of the platform,
making it accessible through SPARQL queries.
The most time-consuming manual task in doing so was to develop the XSLT file
that defines the mapping from the XML elements to the appropriate RDF model; creation of the mapping required developer skills and was a matter of a few hours, including familiarization with the tool and environmental setting. The subsequent transformation of the data itself however took place in a matter of seconds only. RDFizing and
interlinking other data such as nearby POIs and images from DBpedia turned out to be
an easy and less complex task compared to the development of the initial mapping.
http://www.spaghettiopendata.org/
106
Figure 3: The LOD events eXplorer application, showing events in the Trento region
107
Table 1: LIDAPUME template for the Swiss Archive Use Case (1D=1 effort-day)
Using the methodology and the template turned out to be a good starting point for
LOD use case planning, with regards to completeness of the planning, necessary project skills and project duration. Having experience from completed projects at hand,
allows for better estimation and shortens the learning curve.
The LIDAPUME methodology and template have been validated for several use
cases which are described in more detail in [16]. Besides the above described use cases,
it has been applied in enhancing the FU Berlin library content through an LOD use case,
called Library Keyword Clustering, and in the Swiss Archive Use Case [17].
108
http://www.bar.admin.ch/
http://www.dnb.de/DE/Standardisierung/GND/gnd_node.html
10
109
will be developed in the future. By providing docker images,11 the Fusepool platform
can be deployed within an organization within a few hours.
To have a sustainable linked data ecosystem, still more work is necessary on the
user interface level. In a follow-up project, it is thus planned to work with data publishers to simplify the dashboard UI and to add a wizard-style tool guidance: For example,
when the user selects an XML-based data set in a CKAN site that he wants to publish
as linked data, the wizard will suggest using the XSLT transformer. The user still has
the option to choose another transformer like BatchRefine (which adds batch processing capabilities to OpenRefine), but the wizard limits the possible user selections
only to transformers that can take an XML file as input.
In addition, it is planned to develop a cookbook that gives non-technical users
step-by-step instructions including screen casts on how to use the platform. It will be
based on three typical user scenarios, considering first data and subsequently technical
components:
1.
2.
3.
Based on a concrete data set in a CKAN site. The cookbook explains the steps
and the usage of additional tools that may be needed, e.g., how to create an
OpenRefine configuration in order to publish data from a CSV-based format.
Based on a concrete data file. This is very similar to the first scenario, the difference being that the file is not retrieved from a CKAN site but available on a
local drive.
Based on a known data structure and some sample data.
These changes and additions will hopefully simplify and improve the platform, allowing data publishers to use it without further help, hence significantly simplifying the
task of publishing data as linked open data.
Acknowledgement: The research leading to these results has received funding
from the European Union Seventh Framework Programme (FP7/2007-2013) under
grant agreement n 609696.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
11
T. Heath and C. Bizer, Linked Data: Evolving the Web into a Global Data Space, Synthesis Lectures
on the Semantic Web: Theory and Technology, vol. 1, no. 1. pp. 1136, 2011.
T. Berners-Lee, J. Hendler, and O. Lassila, The Semantic Web, Sci. Am., vol. 284, no. 5, pp. 3543,
2001.
W. Hall, Linked Data: The Quiet Revolution, ERCIM News, vol. 96, p. 4, 2014.
F. Bauer and M. Kaltenbck, Linked Open Data: The Essentials. edition mono, Vienna, Austria, 2012.
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann, DBpedia - A
crystallization point for the Web of Data, J. Web Semant., vol. 7, no. 3, pp. 154165, 2009.
ERCIM, ERCIM News, News, 2014. [Online]. Available: http://ercim-news.ercim.eu/en96. [Accessed: 22-Oct-2015].
S. Auer, J. Lehmann, A. C. Ngonga Ngomo, and A. Zaveri, Introduction to linked data and its lifecycle
on the web, LNAI, vol. 8067, pp. 190, 2013.
http://docker.com/
110
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]