Page MenuHomePhabricator

Provide an RDF mapping for external identifiers
Closed, ResolvedPublic

Description

We want to have Wikidata as a central part of the linked data web. This means we need to not just provide the bare identifier (m1234) but also the complete concept URI (imdb.com/m1234).

These properties would be covered: https://www.wikidata.org/wiki/Special:ListProperties/external-id

External identifiers should be treated as resource references in RDF, if we know how to construct a URI from the id in the DataValue.
Such URIs con be constructed based on a URI pattern stored in the property_info table, extracted from a Statement on the property page, just like the formatter URL.

See also the relevant design document.

NOTE: implementation should be done the same way as normalized quantity values in RDF.

Related Objects

StatusSubtypeAssignedTask
ResolvedLydia_Pintscher
Resolveddaniel
Resolvedhoo
ResolvedTobi_WMDE_SW
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolveddaniel
Resolvedaude
Resolveddaniel
Resolveddaniel
OpenNone
ResolvedLydia_Pintscher
DeclinedArthurPSmith
OpenNone
ResolvedLadsgroup
ResolvedLadsgroup

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Note: I44a203bddfe proposes to use the same predicate for plain string IDs and URIs. We should probably use different predicates, like we do for normalized quantities.

Yes, we can't really use same predicate, OWL hates it. URIs and strings are different types in OWL, and predicate is supposed to have only one type. We don't strictly have to follow OWL (we probably already have some iffy things) but we should not do things that are clearly against it.

Also, having two things under the same predicate complicates querying as you start to get duplicates and you need to manually filter them and it's both annoying to the user and hurts query performance.

So let's just define another predicate and use it. Should be simple enough. Something like ...prop/direct-resource/ perhaps? Or just ...prop/resource/ ?

The toolserver service that currently does some of the work should probably be replaced as well.

I'm currently considering two ways to model this:

  • add <http://www.wikidata.org/prop/direct-normalized/> as a predicate for direct statements, and use the existing "normalized" predicates for full statements: example
  • add four new predicates, <http://www.wikidata.org/prop/direct-resource/>, <http://www.wikidata.org/prop/statement-resource/>, <http://www.wikidata.org/prop/qualifier-resource/>, and <http://www.wikidata.org/prop/reference-resource/>: example

(see also: diff between examples)

Note: we already have psn:, prn: and pqn: used for full normalized values. So if we want prefixes for short normalized values, we need new ones.

Then - I prefer generic "normalized value" predicate much more than specific "resource" predicate. However, the above two alternatives do not seem to be really alternatives, but rather complementary: we need predicates to express normalized value both in context of "truthy" statement and in all contexts where simple value can be encountered. That unfortunately may mean 4 new prefixes, but I currently see no way around it.

Change 258523 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Collect canonical URI patterns from statements on properties.

https://gerrit.wikimedia.org/r/258523

Change 377765 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Replace $this->getSettings() with $this->settings

https://gerrit.wikimedia.org/r/377765

It looks like change 258519 implements the first of Daniel’s suggestions (add wdtn: and then reuse psn:, pqn:, prn:). But as Stas already pointed out, that’s problematic because it mixes up predicates for simple values (ps:) and for full values (psv:). Previously, psn: was the normalized variant of psv: (containing normalized full quantity values), but now it’s also the normalized variant of ps: (containing normalized simple external identifier URIs). I’m not sure if this is still something that OWL hates (technically, they’re both IRIs), but it’s inconsistent and confusing.

To clarify – we would have something like this:

wd:Q2 wdt:P2067 5972.37; # direct statement – simple value
      p:P2067 [ # full statement:
        ps:P2067 5972.37; # simple value
        psv:P2067 [ # full value
          wikibase:quantityAmount 5972.37;
          wikibase:quantityUnit wd:Q613726
        ];
        psn:P2067 [ # normalized full value
          wikibase:quantityAmount 5972370000000000000000000.0;
          wikibase:quantityUnit wd:Q11570
        ]
      ];
      wdt:P227 "4015139-6"; # direct statement – simple value
      p:P227 [ # full statement
        ps:P227 "4015139-6"; # simple value
        psn:P227 <http://d-nb.info/gnd/4015139-6> # normalized simple value
      ].

The quantity-type statement (P​2067, mass) has both a ps: triple for the simple value (5972.37 – the yottagrams are elided) and a psv: triple for the full value (5972.37 yottagrams), and the psn: triple is a normalized variant of the full value psv: triple, also pointing to a full value node. But the external-identifier-type statement (P​227, GND ID) only has a ps: triple for the simple value, and the psn: triple would be a normalized variant of that simple value ps: triple, also pointing to a simple value. That gives two different interpretations to the psn: prefix.

So the second suggestion is probably preferable: add four new predicates (direct, statement, qualifier, reference). We can use the URI suffix -resource or perhaps the shorter -uri, yielding the single-letter prefixes wdtu:, psu:, pqu:, and pru:. Then psn: would still always contain full value nodes and psu: would always contain simple values, and there would be two possible variations of ps:psn: for normalized quantities and psu: for normalized external identifiers. (Perhaps we could generalize the meanings to “normalized full values” and “normalized simple values” later… though if we want to do that, we should reconsider the -resource or -uri URI suffix and prefix abbreviation.)

The result would be like the above snippet, except with psu:P227 instead of psn:P227. There would be no psn:P227 because external identifiers don’t have full value nodes. With the generalized interpretation of psu:, we could add psu:P2067 5972370000000000000000000.0 to the mass statement (simple normalized value), but I don’t think that’s necessary.

(I’m just talking about ps:, psv:, psn: and psu: in this comment, but all this applies analogously to pq: and pr:, of course.)

I think this use of psn::

p:P227 [ # full statement
   ps:P227 "4015139-6"; # simple value
   psn:P227 <http://d-nb.info/gnd/4015139-6> # normalized simple value
 ].

is OK. I'll take time to review it more thoroughly in coming days, but on the face of it it looks OK. Also, please note that psn:P123 and psn:P345 do not have to be of the same type - you have to preserve consistency within the same predicate, but different predicates with the same prefix can have different types. In this case, they even happen to have the same type, due to how we represent values, but in general that's not a requirement as long as overall semantics is close and the type is properly defined in the RDF.

So I'd probably be happier with reusing psn: ones and adding wdtn: or whatever it would be for direct statements.

Change 378024 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Update formatterUrlProperty/canonicalUriProperty docs

https://gerrit.wikimedia.org/r/378024

Change 378024 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Update formatterUrlProperty/canonicalUriProperty docs

https://gerrit.wikimedia.org/r/378024

@Lucas_Werkmeister_WMDE, @Smalyshev: can you guys agree on something and write it down, so @Ladsgroup can implement it? I don't know which was is actually the easiest to use or the most semantically sound in RDF, and I'm busy with the db schema for MCR.

I tried to implement the second suggestion made by Daniel as @Lucas_Werkmeister_WMDE said it's better to go at this direction but I'm not sure if I did it correctly, please take a look at the patch.

My current opinion on the matter is that it is better to try and reuse psn, pqn and prn while adding one more value - wdtn or alike - for direct statements. @Lucas_Werkmeister_WMDE, if you have arguments to the contrary please state them, I could be convinced otherwise. It's not a huge issue - it can work either way - but proliferation of different predicates beyond need has both performance implications and architectural clarity implications, so I'd like to keep it under control. That said, there may be a good argument for adding these four. If we add them, though, I'd probably like them to be more generic than "resource" - it's not exactly clear what that means. I'll think a bit more on it and update here.

There is a danger that if wdtn is used for the type that has normalized value not being a resource (URL) then different wdtn predicates would have different target types - but we already have the same situation with wdt, and it is ok - we just will have to have a bit more logic when we generate predicate description, but it's not too hard I think. In any case, now it is not the issue.

The rest of the patch looks mostly OK, except for the notes which I made on gerrit.

Copying Daniel’s comment from Gerrit:

The normalized value of an identifier is not the same as its URI expansion. E.g. ISBNs have a normalized form and a URN, they are not the same.

I think this means that for an ISBN-10 052122151x, the normalized form is 0-521-22151-X while the URN/URI is urn:isbn:0-521-22151-X. But do we have plans to do such external identifier normalization in Wikibase itself? I was under the impression that we expect them to be saved in normalized form in Wikidata itself (for instance, many “external identifier” properties have format constraints to ensure this).

I’ve tried to summarize the two options (without any pro/contra arguments) on-wiki (just for the syntax highlighting): https://www.wikidata.org/wiki/User:Lucas_Werkmeister_(WMDE)/External_identifiers

Copying Daniel’s comment from Gerrit:

The normalized value of an identifier is not the same as its URI expansion. E.g. ISBNs have a normalized form and a URN, they are not the same.

I think this means that for an ISBN-10 052122151x, the normalized form is 0-521-22151-X while the URN/URI is urn:isbn:0-521-22151-X. But do we have plans to do such external identifier normalization in Wikibase itself? I was under the impression that we expect them to be saved in normalized form in Wikidata itself (for instance, many “external identifier” properties have format constraints to ensure this).

Yea, ISBN is probably not a good example, for two reasons: 1) we usually do syntactic normalization when processing input already, so no need for a separate normalized value in RDF and 2) the normalization rule for ISBNs would be bound to the property, not the data type. We don't support this at the moment, it's on the "would be nice" list.

So, forget about ISBNs. My point was: expanding IDs to resource references is not really the same thing as normalizing values. Conflating them makes my semantic pinky tingle. For one thing, it feels like normalized values should keep the same type. This is not the case when expanding external identifiers.

But what I'm expressing here is a feeling of unease, not a blocker or requirement. I'm not an expert on RDF modeling. I wish we had one around ;)

My point was: expanding IDs to resource references is not really the same thing as normalizing values

True. The question is: would there be a type where we want to do both? I.e. provide both normalized value (whatever it is) and secondary/extended URI/URL (whatever it is)? Looks like currently we do not have such type. If we foresee this can happen, we want separate prefixes. If we don't foresee such type happening (at least in the form of normalized/URI combo), we can use same prefix for both.

For one thing, it feels like normalized values should keep the same type.

That depends on normalization I'd guess. But that's a good point. In any case, how we make a decision here?

Change 377765 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Replace $this->getSettings() with $this->settings

https://gerrit.wikimedia.org/r/377765

My point was: expanding IDs to resource references is not really the same thing as normalizing values

That depends on normalization I'd guess. But that's a good point. In any case, how we make a decision here?

We asked @Denny and @mkroetzsch, and they agree with each other that we don't need to distinguish. @Lucas_Werkmeister_WMDE is also happy with just one prefix, and you also seem to be arguing for that solution. I was never opposed to it, just wary of unforeseen consequences. What if we do need both expansion and normalization for some data type eventually?

But the RDF experts agree, and YAGNI applies, so the decision is made: treat URI expansion like value normalization. No need for separate prefixes.

OK, so let's switch the patch to use psn and wdtn (new) prefixes and otherwise it seems to be pretty close to being ready.

Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.

Change 382445 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Fix writing normalized property predicates

https://gerrit.wikimedia.org/r/382445

Change 382446 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Add external identifier prop stmt to RDF tests

https://gerrit.wikimedia.org/r/382446

Change 382473 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Refactor PropertyRdfBuilder logic for readability

https://gerrit.wikimedia.org/r/382473

FYI please see also T177539, "Provide an RDF mapping for external identifiers with third party URIs"

Change 258519 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] RDF mapping for external identifiers

https://gerrit.wikimedia.org/r/258519

Change 382445 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Fix writing normalized property predicates

https://gerrit.wikimedia.org/r/382445

Change 382473 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Refactor PropertyRdfBuilder logic for readability

https://gerrit.wikimedia.org/r/382473

Change 382446 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add external identifier prop stmt to RDF tests

https://gerrit.wikimedia.org/r/382446

As far as I can tell, all the related changes are merged, anything left to do here?

@Smalyshev will this require a reload of the query servers?

There's two way to get data there:

  1. Reload (see T176593)
  2. Extract new data from the dump and load it directly.

The second one is possible in theory, in practice since we still don't have full NT dump, and ttl dump is not readily dissectable, it is not easy to do. We'll probably have to use Wikidata Toolkit or implement new tool to do it. So maybe it's quicker to just bite the bullet and do the full reload.

One thing I’ve just realized (not sure if it was discussed somewhere before): editing a “URI used in RDF” statement on a property invalidates all wdtn:/psn:/pqn:/prn: triples for that property, across all entities. I can’t think of any other kind of edit that would currently cause such a change immediately (since unit conversions aren’t directly based on the live data). Do we have any architecture to deal with this (purge affected Special:EntityData/*.ttl pages from Varnish, reload affected entities in the query service, possibly more I can’t think of right now)?

And is this a big problem? After all, I suppose these URIs should be fairly stable.

Do we have any architecture to deal with this (purge affected Special:EntityData/*.ttl pages from Varnish

No, we do not. In fact, I'm not sure we even have one to deal with simpler cases. See e.g. T128667.

And is this a big problem?

Not sure. Hopefully not too big, WDQS avoids cache anyway (this being one of the reasons why) and we won't change these URIs too much, hopefully.

In order to update that we need to run rebuildPropertyInfo script I guess.

In order to update that we need to run rebuildPropertyInfo script I guess.

That should in fact happen on edit.

There are no short/mid term plans to purge the cache in these cases… see also T112081: [Story] purge cached renderings of IDs when the formatter URL changes. If such a URL changes, this would show up in the next dump (this is not cached) and on Special:EntityData (about ~24h after the change). But the query service might have the old one in certain places for quiet a while :/

Ah, I didn’t think about formatter URLs, which pose the same problem for HTML renderings. Good point.