Dbpedia: A Nucleus For A Web of Open Data
Dbpedia: A Nucleus For A Web of Open Data
1 Introduction
It is now almost universally acknowledged that stitching together the world’s
structured information and knowledge to answer semantically rich queries is
one of the key challenges of computer science, and one that is likely to have
tremendous impact on the world as a whole. This has led to almost 30 years
of research into information integration [15,19] and ultimately to the Semantic
Web and related technologies [1,11,13]. Such efforts have generally only gained
traction in relatively small and specialized domains, where a closed ontology,
vocabulary, or schema could be agreed upon. However, the broader Semantic
Web vision has not yet been realized, and one of the biggest challenges facing such
efforts has been how to get enough “interesting” and broadly useful information
into the system to make it useful and accessible to a general audience.
A challenge is that the traditional “top-down” model of designing an ontology
or schema before developing the data breaks down at the scale of the Web: both
data and metadata must constantly evolve, and they must serve many different
communities. Hence, there has been a recent movement to build the Seman-
tic Web grass-roots-style, using incremental and Web 2.0-inspired collaborative
approaches [10,12,13]. Such a collaborative, grass-roots Semantic Web requires
a new model of structured information representation and management: first
and foremost, it must handle inconsistency, ambiguity, uncertainty, data prove-
nance [3,6,8,7], and implicit knowledge in a uniform way.
Perhaps the most effective way of spurring synergistic research along these
directions is to provide a rich corpus of diverse data. This would enable re-
searchers to develop, compare, and evaluate different extraction, reasoning, and
uncertainty management techniques, and to deploy operational systems on the
Web.
The DBpedia project has derived such a data corpus from the Wikipedia
encyclopedia. Wikipedia is heavily visited and under constant revision (e.g.,
according to alexa.com, Wikipedia was the 9th most visited website in the third
quarter of 2007). Wikipedia editions are available in over 250 languages, with
the English one accounting for more than 1.95 million articles. Like many other
web applications, Wikipedia has the problem that its search capabilities are
limited to full-text search, which only allows very limited access to this valuable
knowledge base. As has been highly publicized, Wikipedia also exhibits many
of the challenging properties of collaboratively edited data: it has contradictory
data, inconsistent taxonomical conventions, errors, and even spam.
The DBpedia project focuses on the task of converting Wikipedia content
into structured knowledge, such that Semantic Web techniques can be employed
against it — asking sophisticated queries against Wikipedia, linking it to other
datasets on the Web, or creating new applications or mashups. We make the
following contributions:
The DBpedia datasets can be either imported into third party applications
or can be accessed online using a variety of DBpedia user interfaces. Figure 1
gives an overview about the DBpedia information extraction process and shows
how extracted data is published on the Web. These main DBpedia interfaces
currently use Virtuoso [9] and MySQL as storage back-ends.
The paper is structured as follows: We give an overview about the DBpedia
information extraction techniques in Section 2. The resulting datasets are de-
scribed in Section 3. We exhibit methods for programmatic access to the DBpedia
dataset in Section 4. In Sections 5 we present our vision of how the DBpedia
Web 2.0 Semantic Web Traditional
Mashups Browsers Web Browser
published via
Virtuoso MySQL
loaded into
DBpedia datasets
Wikipedia Dumps
Article texts DB tables
datasets can be a nucleus for a Web of open data. We showcase several user
interfaces for accessing DBpedia in Section 6 and finally review related work in
Section 7.
Wikipedia articles consist mostly of free text, but also contain different types of
structured information, such as infobox templates, categorisation information,
images, geo-coordinates, links to external Web pages and links across different
language editions of Wikipedia.
Mediawiki4 is the software used to run Wikipedia. Due to the nature of this
Wiki system, basically all editing, linking, annotating with meta-data is done
inside article texts by adding special syntactic constructs. Hence, structured in-
formation can be obtained by parsing article texts for these syntactic constructs.
Since MediaWiki exploits some of this information itself for rendering the user
interface, some information is cached in relational database tables. Dumps of the
crucial relational database tables (including the ones containing the article texts)
for different Wikipedia language versions are published on the Web on a regular
basis5 . Based on these database dumps, we currently use two different methods of
extracting semantic relationships: (1) We map the relationships that are already
stored in relational database tables onto RDF and (2) we extract additional
information directly from the article texts and infobox templates within the
articles.
We illustrate the extraction of semantics from article texts with an Wikipedia
infobox template example. Figure 2 shows the infobox template (encoded within
4
http://www.mediawiki.org
5
http://download.wikimedia.org/
Figure 2. Example of a Wikipedia template and rendered output (excerpt).
a Wikipedia article) and the rendered output of the South-Korean town Bu-
san. The infobox extraction algorithm detects such templates and recognizes
their structure using pattern matching techniques. It selects significant tem-
plates, which are then parsed and transformed to RDF triples. The algorithm
uses post-processing techniques to increase the quality of the extraction. Me-
diaWiki links are recognized and transformed to suitable URIs, common units
are detected and transformed to data types. Furthermore, the algorithm can
detect lists of objects, which are transformed to RDF lists. Details about the in-
fobox extraction algorithm (including issues like data type recognition, cleansing
heuristics and identifier generation) can be found in [2]. All extraction algorithms
are implemented using PHP and are available under an open-source license6 .
Linked Data. Linked Data is a method of publishing RDF data on the Web
that relies on http:// URIs as resource identifiers and the HTTP protocol to
retrieve resource descriptions [4,5]. The URIs are configured to return mean-
ingful information about the resource—typically, an RDF description contain-
ing everything that is known about it. Such a description usually mentions re-
lated resources by URI, which in turn can be accessed to yield their descrip-
tions. This forms a dense mesh of web-accessible resource descriptions that can
span server and organization boundaries. DBpedia resource identifiers, such as
http://dbpedia.org/resource/Busan, are set up to return RDF descriptions
when accessed by Semantic Web agents, and a simple HTML view of the same in-
formation to traditional web browsers (see Figure 3). HTTP content negotiation
is used to deliver the appropriate format.
Web agents that can access Linked Data include: 1. Semantic Web browsers
like Disco7 , Tabulator[17] (see Figure 3), or the OpenLink Data Web Browser8 ;
2. Semantic Web crawlers like SWSE9 and Swoogle10 ; 3. Semantic Web query
agents like the Semantic Web Client Library11 and the SemWeb client for SWI
prolog12 .
<http://dbpedia.org/resource/Busan>
14
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/
LinkingOpenData
Figure 4. Datsets that are interlinked with DBpedia.
owl:sameAs <http://sws.geonames.org/1838524/> .
Agents can follow this link, retrieve RDF from the Geonames URI, and
thereby get hold of additional information about Busan as published by the
Geonames server, which again contains further links deeper into the Geonames
data. DBpedia URIs can also be used to express personal interests, places of
residence, and similar facts within personal FOAF profiles:
<http://richard.cyganiak.de/foaf.rdf#cygri>
foaf:topic_interest <http://dbpedia.org/resource/Semantic_Web> ;
foaf:based_near <http://dbpedia.org/resource/Berlin> .
Another use case is categorization of blog posts, news stories and other doc-
uments. The advantage of this approach is that all DBpedia URIs are backed
with data and thus allow clients to retrieve more information about a topic:
<http://news.cnn.com/item1143>
dc:subject <http://dbpedia.org/resource/Iraq_War> .
6 User Interfaces
User interfaces for DBpedia can range from a simple table within a classic web
page, over browsing interfaces to different types of query interfaces. This section
gives an overview about the different user interfaces that have been implemented
so far.
6.1 Simple Integration of DBpedia Data into Web Pages
7 Related Work
A second project that also works on extracting structured information from
Wikipedia is the YAGO project [16]. YAGO extracts only 14 relationship types,
such as subClassOf, type, familyNameOf, locatedIn from different sources of in-
formation in Wikipedia. One source is the Wikipedia category system (for sub-
ClassOf, locatedIn, diedInYear, bornInYear ), and another one are Wikipedia
redirects. YAGO does not perform an infobox extraction as in our approach. For
determining (sub-)class relationships, YAGO does not use the full Wikipedia
category hierarchy, but links leaf categories to the WordNet hierarchy.
16
http://richk.net/objectsheet/osc.html?file=sparql_query1.os
Figure 7. WikiStory allows timeline browsing of biographies in Wikipedia.
The Semantic MediaWiki project [14,18] also aims at enabling the reuse of
information within Wikis as well as at enhancing search and browse facilities.
Semantic MediaWiki is an extension of the MediaWiki software, which allows
you to add structured data into Wikis using a specific syntax. Ultimately, the
DBpedia and Semantic MediaWiki have similar goals. Both want to deliver the
benefits of structured information in Wikipedia to the users, but use different
approaches to achieve this aim. Semantic MediaWiki requires authors to deal
with a new syntax and covering all structured information within Wikipedia
would require to convert all information into this syntax. DBpedia exploits the
structure that already exists within Wikipedia and hence does not require deep
technical or methodological changes. However, DBpedia is not as tightly inte-
grated into Wikipedia as is planned for Semantic MediaWiki and thus is limited
in constraining Wikipedia authors towards syntactical and structural consistency
and homogeneity.
Another interesting approach is followed by Freebase17 . The project aims
at building a huge online database which users can edit in a similar fashion as
they edit Wikipedia articles today. The DBpedia community cooperates with
Metaweb and we will interlink data from both sources once Freebase is public.
As future work, we will first concentrate on improving the quality of the DB-
pedia dataset. We will further automate the data extraction process in order to
increase the currency of the DBpedia dataset and synchronize it with changes
in Wikipedia. In parallel, we will keep on exploring different types of user inter-
faces and use cases for the DBpedia datasets. Within the W3C Linking Open
17
http://www.freebase.com
Data community project18 , we will interlink the DBpedia dataset with further
datasets as they get published as Linked Data on the Web. We also plan to
exploit synergies between Wikipedia versions in different languages in order to
further increase DBpedia coverage and provide quality assurance tools to the
Wikipedia community. Such a tool could for instance notify a Wikipedia author
about contradictions between the content of infoboxes contained in the differ-
ent language versions of an article. Interlinking DBpedia with other knowledge
bases such as Cyc (and their use as back-ground knowledge) could lead to further
methods for (semi-) automatic consistency checks for Wikipedia content.
DBpedia is a major source of open, royalty-free data on the Web. We hope
that by interlinking DBpedia with further data sources, it could serve as a nucleus
for the emerging Web of Data.
Acknowledgments
We are grateful to the members of the growing DBpedia community, who are
actively contributing to the project. In particular we would like to thank Jörg
Schüppel and the OpenLink team around Kingsley Idehen and Orri Erling.
References
1. Karl Aberer, Philippe Cudré-Mauroux, and Manfred Hauswirth. The chatty web:
Emergent semantics through gossiping. In 12th World Wide Web Conference, 2003.
2. Sören Auer and Jens Lehmann. What have innsbruck and leipzig in common?
extracting semantics from wiki content. In Enrico Franconi, Michael Kifer, and
Wolfgang May, editors, ESWC, volume 4519 of Lecture Notes in Computer Science,
pages 503–517. Springer, 2007.
3. Omar Benjelloun, Anish Das Sarma, Alon Y. Halevy, and Jennifer Widom. Uldbs:
Databases with uncertainty and lineage. In VLDB, 2006.
4. Tim Berners-Lee. Linked data, 2006. http://www.w3.org/DesignIssues/
LinkedData.html.
5. Christian Bizer, Richard Cyganiak, and Tom Heath. How to publish linked
data on the web, 2007. http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/
LinkedDataTutorial/.
6. Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Why and where: A
characterization of data provenance. In ICDT, volume 1973 of Lecture Notes in
Computer Science, 2001.
7. Christian Bizer. Quality-Driven Information Filtering in the Context of Web-Based
Information Systems. PhD thesis, Freie Universität Berlin, 2007.
8. Yingwei Cui. Lineage Tracing in Data Warehouses. PhD thesis, Stanford Univer-
sity, 2001.
9. Orri Erling and Ivan Mikhailov. RDF support in the Virtuoso DBMS. volume P-
113 of GI-Edition - Lecture Notes in Informatics (LNI), ISSN 1617-5468. Bonner
Köllen Verlag, September 2007.
18
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/
LinkingOpenData
10. Alon Halevy, Oren Etzioni, AnHai Doan, Zachary Ives, Jayant Madhavan, and
Luke McDowell. Crossing the structure chasm. In CIDR, 2003.
11. Alon Y. Halevy, Zachary G. Ives, Dan Suciu, and Igor Tatarinov. Schema mediation
in peer data management systems. In ICDE, March 2003.
12. Zachary Ives, Nitin Khandelwal, Aneesh Kapur, and Murat Cakir. Orchestra:
Rapid, collaborative sharing of dynamic data. In CIDR, January 2005.
13. Anastasios Kementsietsidis, Marcelo Arenas, and Renée J. Miller. Mapping data in
peer-to-peer systems: Semantics and algorithmic issues. In SIGMOD, June 2003.
14. Markus Krötzsch, Denny Vrandecic, and Max Völkel. Wikipedia and the Semantic
Web - The Missing Links. In Jakob Voss and Andrew Lih, editors, Proceedings of
Wikimania 2005, Frankfurt, Germany, 2005.
15. John Miles Smith, Philip A. Bernstein, Umeshwar Dayal, Nathan Goodman, Terry
Landers, Ken W.T. Lin, and Eugene Wong. MULTIBASE – integrating hetero-
geneous distributed database systems. In Proceedings of 1981 National Computer
Conference, 1981.
16. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A Core of
Semantic Knowledge. In 16th international World Wide Web conference (WWW
2007), New York, NY, USA, 2007. ACM Press.
17. Tim Berners-Lee et al. Tabulator: Exploring and analyzing linked data on the se-
mantic web. In Proceedings of the 3rd International Semantic Web User Interaction
Workshop, 2006. http://swui.semanticweb.org/swui06/papers/Berners-Lee/
Berners-Lee.pdf.
18. Max Völkel, Markus Krötzsch, Denny Vrandecic, Heiko Haller, and Rudi Studer.
Semantic wikipedia. In Les Carr, David De Roure, Arun Iyengar, Carole A. Goble,
and Michael Dahlin, editors, Proceedings of the 15th international conference on
World Wide Web, WWW 2006, pages 585–594. ACM, 2006.
19. Gio Wiederhold. Intelligent integration of information. In SIGMOD, 1993.