Wikidata:SPARQL query service/WDQS graph split/WDQS Split Refinement

This page was used to seek feedback on how to improve the rules used to split the WDQS graph.

The feedback period ended on May 15 2024 and here is a summary of the outcome:

Duplicate properties in both graph (wd:P*) does not seem necessary and won't be done
The list of types of publications that identify what is a scholarly article have been improved, see the final list of items here
It was discussed whether sitelinks should inform the nature of the split or not but this idea was not incorporated because it might make it harder to understand what is where
Discussions and investigations regarding items that define multiple instance of (P31) which might be ambiguous, it appears that it might not affect a lot of items and that the solution might be to disambiguate these instances by creating separate entities (see the Clinical Trials section of the Talk Page).
Re-thinking how scholarly articles are modeled was raised, esp. by identifying the nature of the publication using a separate property rather than using instance of (P31). This idea should probably be explored and discussed by the wikicite community, it does not affect the nature of the split but could be a nice criteria to take into consideration in the future.

Below is the page as it was written when the feedback period was open.

The first iteration of the graph split that was announced in Wikidata:SPARQL_query_service/WDQS_graph_split is very naive and it consists of a single simple rule: if an entity is an instance of (P31) of scholarly article (Q13442814) then this entity should be moved to the scholarly subgraph and the rest should be in the main subgraph.

This simple split allows to divide the full graph in half^[1] but some questions were raised as to whether or not better rules could be applied to make the split more usable.

In the scope of this document we will express the rules using triple patterns^[2] and/or simple BGP^[3].

For instance, the rule used to make the initial split can be expressed as: ?entity wdt:P31 wd:Q13442814.

The content of this page is open for feedback under its talk page, please see #Feedback.

Limitations

There are limitations in what we can do in term of defining the rules of the splits.

Technical Limitations

The set of rules must be easily implementable in the two contexts we need them:

offline: by separating a dump in multiple subgraphs, required to generate the initial RDF datasets that will be imported into blazegraph
online: to identify in real-time after an edit what subgraph should be updated, required by the WDQS Updater.

The later is obviously the most limiting one, splitting offline based on a dump, given enough resources and time we could apply fairly complex rules to separate the graphs. Splitting in realtime on a per entity basis is on the other hand much more challenging. For this reason the rules should only require the data available at the entity level (locally scoped^[4]).

To illustrate this limitation here are some examples that can be implemented:

a direct statement on the property PXYZ: ?entity wdt:PXYZ []
have a sitelink: [] schema:about ?entity
have a statement with the property PXYZ to the entity QXYZ: ?entity p:PXX/ps:PXYZ wd:QXYZ

And below some examples of rules that cannot be applied in realtime:

Accessing the subclass hierarchy of a class: ?entity wdt:P31/wdt:P279* wd:Q13442814
Accessing the data of another linked entity (i.e. some data related to the author of a publication): ?entity wdt:P50/wdt:P4450 []

In addition to these the split must comply with the objectives of the Wikidata:SPARQL_query_service/WDQS_graph_split by significantly reducing the size of the individual subgraphs.

Note that the query service can be used to test a rule using the simple query below:

# Simple query to test if an entity bound as ?entity matches a given rule
SELECT (IF(COUNT(*) > 0, true, false) as ?result) {
  # Replace Q77051335 with the entity you want to apply the rule for
  VALUES (?entity) {(wd:Q77051335)} 
  # Paste the rule here:
  ?entity wdt:P31 wd:Q13442814 
}

Try it!

Usability

The split must minimize the impact on use-cases that do not depend on scientific publications. Additionally, because the data will be served from two different endpoints it should remain clear what endpoint to use for a particular use-case. Reusers of the data must be able to easily understand what data is in which subgraph.

Suggested improvements

Include more types of publication

Per wikicite.org the target publications are all entities with a instance of (P31) being a subclass of publication (Q732577) or article (Q191067). To transform this into a set of rules we have to unfold all these subclasses, unfortunately many of them do not seem appropriate. Fortunately some work has already been done to manually pick the ones that could be a good fit^[5]

We could also study if this list can be expanded by exploring what are the instance of (P31) used on entities that declare an author (author (P50) or author name string (P2093)) if deemed necessary.

Add triples to help navigate between the subgraphs

To help navigate between the subgraphs we could add a technical triple that would be an indication that a particular item is served by a different endpoint. For instance when querying the scholarly subgraph and requesting an entity whose constituents are queryable from another query service endpoint a new triple can be added to link to that endpoint. It might help users to refine their queries and possibly help reduce the number of triples to join through federation, esp. on properties that mix entities from both graph, for instance main subject (P921) on a publication may reference other publications that might be in the same graph or an actual subject present in the main graph.

To encode this triple two namespaces could be added:

http://wikiba.se/queryservice# with the wikibaseqs prefix and used to encode the predicate
https://query.wikidata.org/subgraph/ with the wdsubgraph prefix and used to encode the resource identifier of the graph

For instance an entity like Black hole explosions? (Q54017915) that is part of the scholarly subgraph would have a triple wd:Q54017915 wikibaseqs:subgraph wdsubgraph:scholarly in the main graph. Similarly Stephen Hawking (Q17714) being served from the main graph would have a triple wd:Q17714 wikibaseqs:subgraph wdsubgraph:main in the scholarly subgraph.

The number of added triples should be around 110M (spread on both splits) which might be acceptable^[6] if proven useful.

Duplicate properties in both graph (wd:P*)

The query service exposes the RDF representation of properties. These can be used in some cases to navigate the graph without precisely naming the properties themselves, for instance listing all the direct claims of a given property that are linking another wikidata item:

SELECT ?item ?property ?value {
  VALUES (?item) {(wd:Q42)}
  ?item ?wdt ?value .
  ?property a wikibase:Property;
        wikibase:propertyType wikibase:WikibaseItem;
        wikibase:directClaim ?wdt.
}

Try it!

Such query would require federation if applied to the scholarly subgraph. Duplicating the property definitions in both subgraphs could allow running this query as-is. The number of added triples should be marginal.

Model publications differently

Another approach that was discussed to help the separation of the two subgraphs is rethinking the way publications are modeled. As of today publications are specialized using instance of (P31) using the wide variety of subclasses available, but another strategy could be inspired by how human (Q5) are modeled using a single P31 but using other properties to specialize the other aspects of it such as occupation (P106).

Sadly, given the time frame of the project this strategy might not be applicable in this context.

Feedback

Please use the discussion page of this page to give us feedback on these improvements, for instance:

if you have comments/concerns on the set of new types of publication expressed in this spreadsheet
to let us know if one of these ideas is useful or useless based on your past experience using the query service

But also please feel free to suggest new ones as long as they comply to the limitations expressed above. Per April update the feedback period will end on May 15, 2024.

Please note that feedback not precisely related to the refinement of the splitting strategy should be given on the talk page of Wikidata:SPARQL_query_service/WDQS_graph_split directly.

Notes

↑ one of the main objectives of this project is to drastically reduce the graph size hosted by a single WDQS node
↑ https://www.w3.org/2001/sw/DataAccess/rq23/#BasicGraphPattern
↑ https://www.w3.org/2001/sw/DataAccess/rq23/#BasicGraphPatternMatching
↑ These are all the triples available when dumping an entity using Special:EntityData, e.g. Q42.ttl.
↑ Please see this spreadsheet for more details.
↑ Main drawback here is that there would be no way during the update process to filter the entities that are actually linked from one subgraph to another so all entities not part of the current subgraph would have to have this triple added

[1] the main objectives of this project is to drastically reduce the graph size hosted by a single WDQS node

[2] ttps://www.w3.org/2001/sw/DataAccess/rq23/#BasicGraphPattern

[3] ttps://www.w3.org/2001/sw/DataAccess/rq23/#BasicGraphPatternMatching

[4] These are all the triples available when dumping an entity using Special:EntityData, e.g. Q42.ttl.

[5] Please see this spreadsheet for more details.

[6] Main drawback here is that there would be no way during the update process to filter the entities that are actually linked from one subgraph to another so all entities not part of the current subgraph would have to have this triple added

[1]

[2]

[3]

[4]

[5]

[6]