Jump to content

RDF metadata

From Meta, a Wikimedia project coordination wiki

So, I just checked in code to add RDF metadata to MediaWiki. I figured I'd post here to give the whys and hows.

See also: RDF, a project for providing extensive and flexible RDF support for MediaWiki.

What

[edit]

The software generates RDF metadata dynamically, according to this recommendation (largely written by me, BTW):

  http://www.communitywiki.org/cw/DublinCoreForWiki

It creates RDF in two formats. One is the simple Dublin Core RDF/XML recommendation:

  http://dublincore.org/documents/dcmes-xml/

The other is the Creative Commons metadata format (slightly and annoyingly dissimilar):

  http://creativecommons.org/technology/metadata/implement

But

[edit]

"But metadata is stupid! And so's RDF! Gar gar gar!"

That's fine. All the code is disabled by default, and the only overhead is checking two global boolean variables. Presence in the main codebase is minimal; most of the RDF stuff is in a separate file, Metadata.php, which isn't even loaded by default.

Why

[edit]

There are two reasons I did this. First is that Creative Commons RDF metadata is used by various CC search engines and other tools to find and index Creative Commons-licensed info. So for Wikitravel, this is quite a boon.

The second reason is to allow bots and spiders to discover authorship and history info without having to, say, have access to the database or parse the history page output.

How

[edit]

When the global metadata flags are set ($wgEnableDublinCoreRdf and $wgEnableCreativeCommonsRdf, respectively), each page is displayed with a <link rel='meta'> tag, with the URI of the wiki script, plus an appropriate action. The actions are 'dublincore' and 'creativecommons'. So the extra stuff looks like this:

   <link rel='meta' type='application/rdf+xml' 
         href='wiki.phtml?title={title}&action=dublincore' />

I added a helper method to OutputPage to make this microscopically easier.

I used linked RDF URIs rather than including the metadata directly to minimize performance impact and DB hits. Since only a few processors are going to use the metadata, it doesn't make sense to output it on every page.

When the 'dublincore' or 'creativecommons' action comes in, if the global enabler flags aren't set, an HTTP error message is returned. (I added a function to GlobalFunctions.php to do this -- I figured it would be useful for other formats, too.) If they're enabled, it loads Metadata.php and calls the appropriate function.

The functions in Metadata.php generate RDF/XML streams, according to the requested format.

Also

[edit]

Different browsers and processors expect different MIME types for RDF: application/rdf+xml (the recommended one), text/rdf (deprecated), text/xml, application/xml. Instead of trying to force this, I wrote a quick bit of content-negotiation code, which is in GlobalFunctions.php. It parses the "Accept:" header, and a similar string for determining the server's preferences, and picks a MIME type based on those preferences.

The server preferences for RDF, for example, are:

   application/rdf+xml,text/xml;q=0.7,application/xml;q=0.5,text/rdf;q=0.1

I kinda figured this was a useful function to have; it could also be applied to the work going on to serve XHTML, e.g.:

   application/html+xml,text/xml;q=0.7,text/html;q=0.5

Or, say, for future support of SVG.

Future

[edit]

I'd like to enhance the code to include Qualified Dublin Core info like history and links data, per the DublinCoreForWiki link above.

I'm still working on getting field-value pairs working; these may be useful for providing user-edited metadata info, like abstracts, keywords, or authors and sources external to the wiki.

I'd also like to provide descriptions for images and other files; that's gonna take a little more work.


Feedback

[edit]

creator != last editor

[edit]

The use of the last editor as creator is about the opposite of what is needed for copyright purposes. The original editor is the creator and all later versions are generally derivative works unless they are a complete rewrite. In a copyleft situation, the last editor has a license from the previous editor and so on all the way back to the original creator. Jamesday 12:25, 1 Jun 2004 (UTC)

The last editor is the creator of the work as it stands. --Evan 22:59, 2 Jun 2004 (UTC)

Examples?

[edit]

It isn't clear if any of the servers are running this now. A pointer to an example of this would be really handy. --Nealmcb 22:22, 4 Jun 2004 (UTC)

This is working now, just check the HTML code of any page, line 8. -- Stw 20:25, 15 Jul 2004 (UTC)


  • How do I actually get to see the RDF data? I tried dropping it in my browser without luck. I also tried to curl it but don't know what parameters to include. -- 18:36, 16 Sep 2004 (UTC)
Add the following at the end of the URL of a wiki page where you whttp://meta.wikimedia.org/w/index.php?title=RDF_metadata&action=editant to see the meta-data:
For creative commons: &action=creativecommons
For dublin core: &action=dublincore
(If you have not activated the rdf meta data you get an empty page displaying a short message) --xephor 11:42, 22 September 2005 (UTC)[reply]

Just the beginning?

[edit]

The presentation of "use-cases" below for RDF and Wikimedia was written without knowledge of the ongoing effort in this area (I should have known that this idea must have come to other minds before :-). For further (and better structured information) see Semantic_MediaWiki1 .

While metadata for documents such as the dublin core in my mind seems to be a very important issue, the idea of the semantic web is more than just that. It is about providing ALL information in a form that can be easily read and transformed by machines. Applications querying and reasoning about such data ensure that it is also "understood" by machines. Rewriting Wikipedia articles in such a machine processable format with well-defined vocabularies and ontologies would open great new perspectives both for the maintenance of the vast encyclopedia that Wikipedia has become, and for the usage of Wikipedia by third parties:

  • Reasoning about the accuracy of Wikipedia content: A large fraction of Wikipedia articles being about persons, it would make sense to introduce a vocabulary about when these persons lived, where they were born, about their relatives, etc. A wikipedia article about John F Kennedy could contain the following triples (in no specific syntax):

<www.wikipedia.org/Persons/John_F_Kennedy> <http://some_domain/rdfvocabulary/born> "May 29, 1917"

<www.wikipedia.org/Persons/John_F_Kennedy> <http://some_domain/rdfvocabulary/died> "November 22, 1963"

<www.wikipedia.org/Persons/John_F_Kennedy> <http://some_domain/rdfvocabulary/mother> <www.wikipedia.org/Persons/Rose_Fitzgerald>

<www.wikipedia.org/Persons/John_F_Kennedy> <http://some_domain/rdfvocabulary/father> <www.wikipedia.org/Persons/Joseph_P_Kennedy_Sr>

Provided that also for Rose Fitzgerald and Joseph Kennedy such data were available, an automatic reasoner could check for the consistency of the birth dates (his parents should be at least 15 years older than him :-). Furthermore, a pedigree could be automatically built from this data and be presented on a separate page. Finally, it could be checked, that JFK's siblings have the same parents. I am sure that there are many more - and even more beneficial - applications of an RDF-version of wikipedia articles, which help to enhance the correctness of wikipedia information and that may also suggest derived information to be included. Of course RDF-versions do not need to be translated in other languages, which suggests to just treat RDF as just another translation such as German or French.

  • Another example would be to include geographical information for cities, countries, states, rivers, borders, highways and other transportation. Again, inconsistencies could be easily discovered and fixed, "aggregation pages" could list all cities within a state or calculate the distance between cities and borders etc. The more vocabularies (about persons, about geography, etc) are introduced and used on the wikipedia, the more interdependent reasoning using more than one vocabulary is possible. The value of Wikipedia RDF content would increase exponentially with its amount.

Both of the above examples focus more on the benefits that would emerge for the wikipedia "house-keeping" and maintenance. An even more compelling reason to introduce machine-processable information is that users would have entirely new query possibilities, resembling those that we know from relational database query languages. Queries such as "give me a list of all of JFK's ancestors" or "which countries share a border with Poland" could be easily formulated and answered. The nice thing about using RDF (as a W3C standard) for the representation of such data is that neither the query functionality nor the reasoning about the correctness would have to be integrated in wikimedia, but could be provided by third parties, in a similar way that Wikipedia-Bots already correct spelling mistakes today.

Finally, RDF and the Wiki-principle are a perfect match. The take-off of the semantic web is slowed down by the need for trust. Anybody can write information on the web, but it is hard to see which information is indeed correct. Wikipedia has shown, that the collaborative editing of articles leads to better quality and less disinformation. (A sign for the reliability of Wikipedia is that its articles very often appear as the first search result in Google.) Therefore, also RDF data in the wikipedia is more trustworthy than on nonfamous web pages. The second Wiki-characteristic which makes it suitable for RDF is that anybody can easily contribute. Contributing information is almost as easy as getting information since the dawn of the WikiWikiWeb.

Of course, there are also drawbacks to be considered:

  • The availability of metadata could highly increase server load. Therefore a possiblity to disable RDF content should be provided.
  • Users may not want to contribute RDF data, because of its complexity. This is certainly a valid objection, and I do assume, that only a fraction of the community will contribute in this "language". However, this argumented is weakened that all "language communities" will work together on the same RDF version, and also that for the technically advanced, contributing in RDF will be much easier, since it is easier to author (Spelling, grammar, and style is less an issue).
  • A final objection is that wikipedia is an encyclopedia, not a data base. This is true, but as has been shown above, machine-processable information also helps to increase the value of wikipedia as an encyclopedia. Moreover, we should not let tradition and customs from the "pre-wiki" era keep us from exploring new possibilities, as long as there is nothing to lose.

Benedikt

Creative Commons

[edit]

Even the CVS version doesn't seem to know about the CC 2.5 licences. It would be good if that could be fixed - I know it's trivial but a CVS commit would make one less hack for me to do myself every time the MW version changes. Following changes in getKnownLicenses seemed to work for me:

$ccVersions = array('1.0', '2.0', '2.5');
if( $version != '1.0' && substr( $license, 0, 2) != 'by' ) {
--Kingboyk 22:09, 19 January 2012 (UTC)[reply]