Page MenuHomePhabricator

Add Wikidata query service lag to Wikidata maxlag
Closed, ResolvedPublic8 Estimated Story Points

Description

As a high-volume editor on Wikidata, I want to ensure that my edits do not impair query service responsiveness.

Problem:
maxlag is a parameter that API users can specify to avoid overloading the wiki: if I send an API request with maxlag=5, and the database replicas are currently more than five seconds behind the master, then MediaWiki will immediately refuse the request. Afterwards, I’m supposed to wait for a bit before retrying the request. See https://www.mediawiki.org/wiki/Manual:Maxlag_parameter.

Last year, we modified the API’s behavior so that this takes into account not just the replication lag, but also the dispatch lag (T194950: Include Wikibase dispatch lag in API "maxlag" enforcing) – if the database replicas are fine, but change dispatching to client wikis is more than 5 minutes behind, then requests with maxlag=5 will still be rejected. (The dispatchLagToMaxLagFactor is configurable, 60 in production, so the threshold for dispatch lag is in minutes instead of seconds.)

However, this does not take the query service lag into account – if updates on some or all of the WDQS servers start to lag behind, edits will continue at full speed as long as database replication and client dispatching are not affected. This can happen because query service lag depends not just on edit rate but also on the size of the entities edited (on each edit, the full entity is reloaded, even if only a small part of it was edited, so editing large items has a disproportionate impact) and the rate of external queries against the server.

BDD
GIVEN all the WDQS servers are lagged by more than one hour
WHEN I send a wbeditentity API request
AND I set the maxlag parameter to 5 seconds
THEN I should get a maxlag error
AND no edit should be made

(the GIVEN part describes a rather extreme case; once the open questions below are qualified, it can perhaps be changed to a more realistic case)

Acceptance criteria:

  • the effective max lag takes query service update lag into account

Open questions:

  • What should the conversion factor be? (For dispatch lag, it’s 60 – five seconds of replication lag are equivalent to five minutes of dispatch lag.)
  • Lag between different servers can differ significantly. Do we use the mean lag? The median? The maximum? Something else? (For dispatch lag, we seem to use the median.)
  • maxlag affects all API requests, even ones that shouldn’t have any effect on query service lag, such as action=wbgetentity or action=query&meta=userinfo. Should we try to limit the impact of this change, e. g. by only using query service lag on POST requests? (On the other hand, the same question should apply to dispatch lag and we don’t seem to limit the impact of that as far as I can tell.)

Break-down

Details

SubjectRepoBranchLines +/-
mediawiki/extensions/Wikidata.orgmaster+21 -3
mediawiki/extensions/Wikidata.orgwmf/1.35.0-wmf.28+2 -2
operations/mediawiki-configmaster+1 -1
mediawiki/extensions/Wikidata.orgmaster+1 -1
mediawiki/extensions/Wikidata.orgwmf/1.35.0-wmf.5+1 -1
operations/mediawiki-configmaster+4 -0
operations/puppetproduction+16 -3
mediawiki/extensions/Wikidata.orgwmf/1.35.0-wmf.5+47 -0
mediawiki/extensions/Wikidata.orgmaster+1 -1
mediawiki/extensions/Wikidata.orgmaster+104 -0
mediawiki/extensions/Wikidata.orgwmf/1.35.0-wmf.5+104 -0
mediawiki/extensions/Wikidata.orgwmf/1.35.0-wmf.5+3 -1
mediawiki/extensions/Wikidata.orgwmf/1.35.0-wmf.5+679 -3
mediawiki/extensions/Wikidata.orgmaster+47 -0
mediawiki/extensions/Wikidata.orgmaster+3 -1
mediawiki/extensions/Wikidata.orgmaster+679 -3
mediawiki/extensions/Wikidata.orgmaster+32 -0
mediawiki/extensions/Wikibasemaster+854 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 551857 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Maintenance script to update cached wdqs lag

https://gerrit.wikimedia.org/r/551857

Change 551858 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Add ApiMaxLagInfo hook

https://gerrit.wikimedia.org/r/551858

Just prepared the patches needed to backport this (as otherwise it won't go out with the train and land for 2-3 weeks.

Change 551855 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Add WDQS lag into request maxlag logic.

https://gerrit.wikimedia.org/r/551855

Change 551856 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Switch from load_composer_autoloader AutoloadNamespaces

https://gerrit.wikimedia.org/r/551856

Mentioned in SAL (#wikimedia-operations) [2019-11-19T17:11:30Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag) [[gerrit:551855]] [[gerrit:551856]] (duration: 00m 54s)

Change 551869 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] onApiMaxLagInfo ttl to 1

https://gerrit.wikimedia.org/r/551869

Change 551857 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Maintenance script to update cached wdqs lag

https://gerrit.wikimedia.org/r/551857

Mentioned in SAL (#wikimedia-operations) [2019-11-19T17:43:52Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag, maint script) [[gerrit:551857]] (duration: 00m 52s)

Change 551869 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] onApiMaxLagInfo ttl to 1

https://gerrit.wikimedia.org/r/551869

Change 551858 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] Add ApiMaxLagInfo hook

https://gerrit.wikimedia.org/r/551858

Mentioned in SAL (#wikimedia-operations) [2019-11-19T18:20:28Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (queryservice maxlag, hook) [[gerrit:551858]] (duration: 00m 53s)

Just the cron job to go now
https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/551582/
The default settings mean that 1 hour of query lag = 1 second of maxlag.
I'll look at bringing this setting down once everything (including cron) is actually deployed.

Change 551582 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki/wikidata maint cron for updateQueryServiceLag

https://gerrit.wikimedia.org/r/551582

Change 552069 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 180

https://gerrit.wikimedia.org/r/552069

Change 552069 merged by jenkins-bot:
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 180

https://gerrit.wikimedia.org/r/552069

Change 552071 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552071

Change 552072 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552072

Change 552072 merged by Addshore:
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.5] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552072

Mentioned in SAL (#wikimedia-operations) [2019-11-20T14:50:53Z] <addshore@deploy1001> Synchronized php-1.35.0-wmf.5/extensions/Wikidata.org: T221774 - Wikidata.org extension (use altered lag, not raw lag) [[gerrit:552072]] (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-11-20T14:52:39Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 180 [[gerrit:552069]] (duration: 00m 52s)

Change 552071 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] qsmaxlag, send altered lag info, not only raw lag

https://gerrit.wikimedia.org/r/552071

Mentioned in SAL (#wikimedia-operations) [2019-11-20T15:36:08Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: RESYNC T221774 - wgWikidataOrgQueryServiceMaxLagFactor 180 [[gerrit:552069]] (duration: 00m 52s)

The first sync didn't seem to do the trick, but the second one did.

The discovery team suggested that perhaps median might not be the best option.
We should investigate using https://config-master.wikimedia.org/pybal/eqiad/wdqs and https://config-master.wikimedia.org/pybal/codfw/wdqs to load the lag state of ONLY pooled servers, and then we could actually use max.

Will probably do this in a separate ticket.

Mentioned in SAL (#wikimedia-operations) [2019-11-20T18:42:56Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 170 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-11-20T18:51:14Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 120 (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-11-20T18:56:43Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: RESYNC T221774 - wgWikidataOrgQueryServiceMaxLagFactor 120 (duration: 00m 50s)

Currently multiple tools are broken because the time that make maxlag back to normal is much longer than the total time the tool retries edits (see this and this).

Currently multiple tools are broken because the time that make maxlag back to normal is much longer than the total time the tool retries edits (see this and this).

Good that this change works as intended? Tools should be updated to handle maxlag in a graceful manner. Retry time should not be something fixed, but something which increases if you get it or not. So 5, 10, 20, 40, 80, etc. This also makes sure that not all tools restart at the same time.

Thanks for the notification! I would be happy to release a new version of OpenRefine with a patch applied - I can do this in the coming days. The exponential back-off suggested by @Multichill makes sense intuitively - could WMDE confirm that this is the policy they recommend? Happy to adapt the policy as required.

Notable that WDQS lag seems to be paying not a blind bit of notice to this change :(

WDQS 1005-7 all at ~3 hours.

https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&panelId=8&fullscreen

Notable that WDQS lag seems to be paying not a blind bit of notice to this change :(

WDQS 1005-7 all at ~3 hours.

https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&panelId=8&fullscreen

Due to the apparent differences in lag pattern between some of the nodes we need to try to get T238751 done so that we can use the max lagged server rather than the median (which we currently do).

So this is what we get with an exponential back-off (1.5 factor), at the moment:

22:37:27.148 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 1000 milliseconds. (19338ms)
22:37:28.729 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 1500 milliseconds. (1581ms)
22:37:33.809 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 2250 milliseconds. (5080ms)
22:37:37.931 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 3375 milliseconds. (4122ms)
22:37:42.663 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 5062 milliseconds. (4732ms)
22:37:49.437 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 7593 milliseconds. (6774ms)
22:37:58.429 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 5.4916666666667 seconds lagged. -- pausing for 11389 milliseconds. (8992ms)
22:38:18.217 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 6 seconds lagged. -- pausing for 17083 milliseconds. (19788ms)
22:38:36.461 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 6 seconds lagged. -- pausing for 25624 milliseconds. (18244ms)
22:39:05.013 [..baseapi.WbEditingAction] [maxlag] Waiting for all: 6.4666666666667 seconds lagged. -- pausing for 38436 milliseconds. (28552ms)

So it looks like this means no OpenRefine edits at all with these new rules, in the current situation.

Actually maxlag may be more than 5 second for up to one hour. tools should not break (not "give up") even no edits can be made.

@Addshore what factor are we using right now? Most bots seem to have stopped so it might be on the low side.

Change 552474 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 60

https://gerrit.wikimedia.org/r/552474

@Addshore what factor are we using right now? Most bots seem to have stopped so it might be on the low side.

I have slowly been lowering the factor, just about to set it at 60, so that 1 minute of lag = 1 second of max lag.
Once we get down this low, we should end up in a position where getting into such a bad state as we have been in over the past days / week is fairly hard / impossible.
I would like to think that the spikes of maxlag would be much smaller, but we will see, always open to changing the factor if it doesn't work so well.

image.png (280×913 px, 40 KB)

As can be seen with the maxlag over the past 12 hours, it seems to be working.

This system will be much better when T238751 is also done.

Change 552474 merged by jenkins-bot:
[operations/mediawiki-config@master] wgWikidataOrgQueryServiceMaxLagFactor 60

https://gerrit.wikimedia.org/r/552474

Mentioned in SAL (#wikimedia-operations) [2019-11-22T12:32:27Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 60 (duration: 00m 53s)

Mentioned in SAL (#wikimedia-operations) [2019-11-22T12:34:11Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 60 RESYNC (duration: 00m 51s)

One problem with the current policy (requesting all automated editing processes to use maxlag=5) is that this creates a binary threshold: either the query service lag is under the threshold, in which case bots will edit at full speed, or the query service lag is above the threshold, in which case they should all stop editing entirely. This is likely to create an oscillating behaviour, where all bots start and stop periodically. This is probably not ideal neither for the infrastructure nor for the users.

Not all automated edits are long-running tasks where stopping for an hour is acceptable: many QuickStatements or OpenRefine batches are under 10 edits for instance. As a user of these tools, I would expect them to slow down when lag is high, but stopping entirely should be extremely rare, as it breaks workflows.

I think it would be preferable to get a continuous throttling behaviour: the higher the lag, the slower the edits.
This can already be achieved by clients by using a small maxlag parameter and increasing it gradually as they retry. For instance, start with maxlag=2, which fails, retry in 1 sec with maxlag=3, which fails, retry in 4 seconds with maxlag=4, which fails, retry in 8 seconds with maxlag=5, which fails, retry in 16 seconds with maxlag=6, which succeeds.

Would such a behaviour be acceptable?

Actually, some tools seem to be doing something like that already, since edits are still going through despite max lag being above 5 for more than an hour now (Author Disambiguator does this, QuickStatements too probably, Edoderoobot too). So these tools use higher (more agressive) maxlag values than 5.

I will therefore follow their example and publish a new version of OpenRefine which does that. It might be worth updating the official recommendations about this: if the rule is never to set maxlag higher than 5, then it looks like not many are actually following it…

I think it would be preferable to get a continuous throttling behaviour: the higher the lag, the slower the edits.

Sounds sensible to me, 5 should perhaps still be the point at which to stop.
But if editing slows down when the "maxlag" is 0-5 then maybe we will stop reaching 5 in the first place :)

Actually, some tools seem to be doing something like that already, since edits are still going through despite max lag being above 5 for more than an hour now (Author Disambiguator does this, QuickStatements too probably, Edoderoobot too). So these tools use higher (more agressive) maxlag values than 5.

I can go ahead and check the values probably.
It is also possible that some bots and tools do not send max lag values.

The theory the logic of maxlag is that it is how lagged the SQL dbs are.
So, if you have a maxlag of 5, you can wait 5 seconds and in theory that lag might now be gone.
We are now abusing that somewhat by including dispatching and the query service lag which don't quite fit that model which is why factors are applied to them.

OK! If you have ways to check what sort of maxlag values are used it would be great!

I will try to implement a responsible throttling strategy in OpenRefine, hoping that others will do the same.

I will try to implement a responsible throttling strategy in OpenRefine, hoping that others will do the same.

Once this is done, if by any chance you want to share and explain how you did it, for example in a blog post, so we can share with other tool builders, it would be absolutely wonderful <3

I am first getting in touch with people who seem to be running bots with maxlag greater than 5 or no maxlag parameter at all, to see if they would accept to follow @Addshore's advice never to use maxlag greater than 5 at all.

Am I misreading this graph? https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-12h&to=now&refresh=10s It looks like the query service lag for 3 of the servers has been growing steadily for the past roughly 8 hours. However, edits are going through. Did something change in the maxlag logic somewhere earlier today?

maxlag is currently based on median (not max) of lag of all public servers.

@Bugreporter well something must have changed early today - was it previously "mean" and is now "median"? I'm not sure which is better, but having WDQS hours out of date (we're over 4 hours now) is NOT a good situation, and what this whole task was intended to avoid! @Pintoch any thoughts on this?

Due to the apparent differences in lag pattern between some of the nodes we need to try to get T238751 done so that we can use the max lagged server rather than the median (which we currently do).

I don't know - I stopped working on this task and T240369 since T240374 was declined. I don't think I can contribute to solving this problem in the current state of affairs, sorry!

I don't know - I stopped working on this task and T240369 since T240374 was declined. I don't think I can contribute to solving this problem in the current state of affairs, sorry!

Excuse me? What does that have to do with it? Don't put this on me. Maxlag is just a configuration setting in Pywikibot and default is 5, there is no "code to fix". Every once in a while I might set it to a different value while testing so I'm not wasting my time. Generally it's just set to 5.

This does not have anything to do with you indeed! I was just trying to explain that I stopped trying to help solve this issue (therefore unsubscribing from this).

Am I misreading this graph? https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-12h&to=now&refresh=10s It looks like the query service lag for 3 of the servers has been growing steadily for the past roughly 8 hours. However, edits are going through. Did something change in the maxlag logic somewhere earlier today?

Note that this dashboard includes metrics for both pooled and depooled servers.
So whatever you read there will likely also be reporting data for servers that you can't actually query thus are not seeing the lag for via the query service

[...]

Note that this dashboard includes metrics for both pooled and depooled servers.
So whatever you read there will likely also be reporting data for servers that you can't actually query thus are not seeing the lag for via the query service

However there was definitely a significant lag on some (most?) of the wdqs servers available for querying when I noticed this problem - updates done over an hour previous were not visible when I queried. But it seems to have resolved for now.

Today 8-10s for the whole day (WDQS lag 8-10 minutes), none of Pywikibot tests loading WD succeed within timeout. See T243701 for more details

Today 8-10s for the whole day, none of Pywikibot tests loading WD succeed within timeout. See T243701 for more details

Reference for the people looking here (last 2 days):

image.png (306×1 px, 140 KB)

Anything above the orange line is maxlag greater than 5

Change 589873 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589873

Change 589874 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@wmf/1.35.0-wmf.28] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589874

Change 589874 abandoned by Addshore:
Switch query service maxlag to be median 1

https://gerrit.wikimedia.org/r/589874

Change 589873 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Switch query service maxlag to be median +1

https://gerrit.wikimedia.org/r/589873