Page MenuHomePhabricator

Expose constraint violations to WDQS
Closed, ResolvedPublic8 Estimated Story Points

Description

We could expose the constraint violations to Wikidata-Query-Service, so we could query them.
For this we need to have an interface that allows us to write to Wikidata-Query-Service from Wikibase-Quality.
So when running a constraint check for a specific Item, we could delete its existing violations and create new ones in Wikidata-Query-Service

This would allow queries like:
Give me all

  • mandatory constraint violations
  • for IMDb ID (P345)
  • for actors that live in Germany and are born before 1945

See https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Constraints for the details of RDF implementation (TLDR: wikibase:hasViolationForConstraint predicate).

NOTE: Not all constraint violations are exposed, yet. At the moment only a small fraction is available in WDQS. We can further improve it when T189458: re-enable wdqs kafka poller is working.

Demo
All statements with constraint violations:
http://tinyurl.com/yd5t689d

Map/timeline/image grid of items that have a statement with a constraint violation:
http://tinyurl.com/yd62za8q

Bar chart of statements that have a constraint violation
grouped by instance of the regarding item:
http://tinyurl.com/ycb8oswo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So, if you want to put a dataset into the database, here's the questions to answer:

  1. What kind of queries we are planning to run? Which use cases they would support? I am not sure I am clear on use case for "constraint violations for actors that live in Germany and are born before 1945" - what use case would produce such query?

The use case is similar to the existing maintenance queries. A user wants to keep his domain or project clean and achieve the 100% criteria.

  1. Do we need to have the data in WDQS at all? We have MWAPI gateway, maybe we could just query the suitable API?

Storing it somewhere else will not allow to scale that easily and be flexible with the queries at the same time.

  1. Is this data set separate from Wikidata data or needs to be in the same namespace (depends on cross-querying needs)?

Yes, cross querying is needed.

  1. What is the data model (would be nice to have a wiki page describing it)?

@Lucas_Werkmeister_WMDE could you please provide a draft.

  1. How the data are updated - when update happens, what triggers it, which data are updated, how soon we need the updates, etc. Note that there is no external push write interface to the database, by design, and having it would involve significant security hurdles to clear - to ensure that only authorized clients can modify the data, and only the part of the data they are authorized to. As Blazegraph does not have support for users/roles and other access controls, we may have to find some solution to it.

When a constraint check is executed for an Item the result will be stored and the old result for that Item will be deleted.
Access will only be allowed from within the cluster.

  1. How these data would be imported/reimported if node is reimaged? Right now WDQS is designed as a secondary data storage - i.e. it does not store any data which does not have primary source, and can be cleaned up and restored from external sources.

It will never be imported and it will never be complete.
It is just a snapshot.

When a constraint check is executed for an Item the result will be stored and the old result for that Item will be deleted.

So, constraint checks are bound to existing items? I wonder if it would be possible to expose them in the same way page properties are exposed.

It will never be imported and it will never be complete.

It is just a snapshot.

But a snapshot of what? Right now we can (and do) delete the whole database, reimage any host and restore it from latest dump + RC feed + category dump. However, neither of those contains constraint checks data. So how this would be managed?

Access will only be allowed from within the cluster.

Still, this means we need to create non-local write interface that previously didn't exist, and put access controls (by IP or otherwise) to it. Will need to research how easy that would be...

So, constraint checks are bound to existing items? I wonder if it would be possible to expose them in the same way page properties are exposed.

Yes, kinda. The problem is they are very expensive and it can take multiple seconds or even a minute for them to be executed.
That is why we cannot put them to the page properties AFAIK.
We currently execute them when a logged in user visits the item page.
It is done with a widget that calls the wbcheckconstraints API

But a snapshot of what? Right now we can (and do) delete the whole database, reimage any host and restore it from latest dump + RC feed + category dump. However, neither of those contains constraint checks data. So how this would be managed?

It is ok if they are deleted. They will be repopulated over time when users execute constraint checks.

So from what I understood you would prefer a pulling mechanism similar to recent changes. Is that true?
Pulling should not be a problem. We haven an API and we could notify you via event bus.
Important would be that you only pull cached results, because calculating them is very expensive.
Does that sound good?

It is ok if they are deleted. They will be repopulated over time when users execute constraint checks.

That would be only true for the entities that have been visited by logged in users since the deletion. It could take weeks or even months to have the required entities in the cache to be able to find and fix all the constraint violations "for actors that live in Germany and are born before 1945" using the queries. Even if the violations are stale for some time it would be nice having a job or something going through the entities creating and caching reports for the entities that don't have one stored. At least from the place where WDQS will be pulling that info.

That would be only true for the entities that have been visited by logged in users since the deletion. It could take weeks or even months to have the required entities in the cache to be able to find and fix all the constraint violations "for actors that live in Germany and are born before 1945" using the queries. Even if the violations are stale for some time it would be nice having a job or something going through the entities creating and caching reports for the entities that don't have one stored. At least from the place where WDQS will be pulling that info.

Sorry, I know it is not optimal, but would you rather want to wait for the ideal solution or have the querying now?
We can later still fix the restore part and there are multiple workarounds we could use meanwhile...

The data model would be fairly simple – a single triple, as I already mentioned.

wds:Q42-45E1E647-4941-42E1-8428-A6F6C848276A wikibase:hasViolationForConstraint wds:P463-6F6E17F0-2650-4835-9250-2F893C77B301.

Both the subject and the object are statement nodes. The exact name of the predicate is still up for discussion. To keep the data model simple, I think we can squash constraint violations for the main snak, qualifiers, and references all into the same predicate (which is why I’m proposing “has violation” and not “violates” now), and people can see where exactly the violation is when they visit the entity.

Example query (the one in the task description: mandatory constraint violations for “IMDb ID” on actors living in Germany born before 1945):

SELECT ?item ?itemLabel ?constraintTypeLabel WHERE {
  wd:P345 p:P2302 ?constraint.
  ?constraint ps:P2302 ?constraintType;
              pq:P2316 wd:Q21502408.
  ?item wdt:P31 wd:Q5;
        wdt:P106 wd:Q33999;
        wdt:P551/wdt:P17 wd:Q183;
        wdt:P569 ?dob.
  FILTER(?dob < "1945-01-01"^^xsd:dateTime)
  ?item p:P345/wikibase:hasViolationForConstraint ?constraint.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

There are some more requested example queries in T172380: Query constraint violations with WDQS (I just made that task a parent task of this one).

After discussion with @Jonas here's what we can do now without very major efforts:

  1. We make wbcheckconstraints API to produce RDF representation (also TBD) of the checks
  2. We create a parameter for wbcheckconstraints to only deliver results if they can be delivered fast (e.g. already cached)
  3. WDQS Updater, when updating an edited item, will also pull the API above and load the constraints data, and join it with the rest of the data.

This means it will only be updated when the item is edited, and only in the case if the constraint check can prepare the cached data by the time Updater gets to it. It also has a race condition where one server could hit the Wikidata before constraints are ready and the other - after, so the servers will have different data. We will need to see whether it is a real concern in production. But we could at least try this one as a prototype.

This means it will only be updated … if the constraint check can prepare the cached data by the time Updater gets to it.

The updater usually doesn’t take more than a few seconds to reach an item, right? I’m skeptical whether this will be possible…

Also, the updater needs to learn how to remove the old constraints data. (I guess it already knows how to remove other old data from the item, so hopefully that shouldn’t be too difficult.)

The updater usually doesn’t take more than a few seconds to reach an item, right?

Yes.

I’m skeptical whether this will be possible…

Then we need a different model.

Also, the updater needs to learn how to remove the old constraints data.

That's not a problem, we do the same thing for the rest of the RDF data.

This comment was removed by Smalyshev.
This comment was removed by Smalyshev.

Change 434015 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] [WIP] [DNM] Add loading constraints data

https://gerrit.wikimedia.org/r/434015

Smalyshev moved this task from Doing to Waiting/Blocked on the User-Smalyshev board.

Change 434015 merged by jenkins-bot:
[wikidata/query/rdf@master] Add loading constraints data

https://gerrit.wikimedia.org/r/434015

Change 445454 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable fetching constraints for Updater

https://gerrit.wikimedia.org/r/445454

Change 445454 merged by Gehel:
[operations/puppet@production] Enable fetching constraints for Updater

https://gerrit.wikimedia.org/r/445454

Change 447740 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable constraints fetching for test cluster

https://gerrit.wikimedia.org/r/447740

Change 447741 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable constraints fetching on internal cluster

https://gerrit.wikimedia.org/r/447741

Change 447742 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable constraints loading everywhere

https://gerrit.wikimedia.org/r/447742

Change 447740 merged by Gehel:
[operations/puppet@production] Enable constraints fetching for test cluster

https://gerrit.wikimedia.org/r/447740

Change 447741 merged by Gehel:
[operations/puppet@production] Enable constraints fetching on internal cluster

https://gerrit.wikimedia.org/r/447741

Change 449329 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] Add documentation for wikibase:hasViolationForConstraint

https://gerrit.wikimedia.org/r/449329

Change 447742 merged by Gehel:
[operations/puppet@production] Enable constraints loading everywhere

https://gerrit.wikimedia.org/r/447742

Smalyshev updated the task description. (Show Details)

Awesome! Can we get a more detailed description for the queries in the description so Léa can announce it?

Change 449329 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add documentation for wikibase:hasViolationForConstraint

https://gerrit.wikimedia.org/r/449329