Page MenuHomePhabricator

Job queue is increasing non-stop
Closed, ResolvedPublic

Assigned To
Authored By
Ladsgroup
Aug 21 2017, 12:27 PM
Referenced Files
F10519970: Screen Shot 2017-10-30 at 6.19.11 PM.png
Oct 30 2017, 5:23 PM
F9228251: RefreshLinksJob.svg
Aug 31 2017, 6:26 PM
F9232855: Screen Shot 2017-08-31 at 00.21.36.png
Aug 30 2017, 11:27 PM
F9232784: Screen Shot 2017-08-31 at 00.15.34.png
Aug 30 2017, 11:27 PM
F9232210: Screen Shot 2017-08-31 at 00.00.31.png
Aug 30 2017, 11:01 PM
F9232209: Screen Shot 2017-08-31 at 00.00.18.png
Aug 30 2017, 11:01 PM
F9141847: image.png
Aug 21 2017, 12:27 PM
Tokens
"Burninate" token, awarded by Liuxinyu970226."Burninate" token, awarded by daniel.

Description

This doesn't sound good:

August 21
image.png (910×1 px, 172 KB)

Current: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1

Mitigation

  • cb7c910ba Fix old regression in HTMLCacheUpdate de-duplication. @aaron, https://gerrit.wikimedia.org/r/373979

    Following a refactor in mid-2016, the duplication logic for recursive updates started to be miscalculated (based on current time, instead of time from the initial placeholder/root job).

    Fixing this significantly decreased job queue growth.
Dashboard: Job Queue Helath
Screen Shot 2017-08-31 at 00.00.31.png (624×2 px, 203 KB)
Screen Shot 2017-08-31 at 00.00.18.png (724×1 px, 85 KB)
Job queue size down from 10M to a steady ~2M. Before the regression it was a steady between 100K -1M.
Dashboard: Varnish stats
Screen Shot 2017-08-31 at 00.15.34.png (916×1 px, 204 KB)
Purge rate from production Varnish servers reduced by 2-3X, from 75-100K/s to ~30K/s.
Dashboard: Job Queue Rate for htmlCacheUpdate
Screen Shot 2017-08-31 at 00.21.36.png (670×2 px, 142 KB)
Queue rate of htmlCacheUpdate back to normal. Deduplication/Superseding optimisation is now working. Execution speed has increased.

Patch-For-Review:

See also:

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+5 -1
operations/mediawiki-configmaster+1 -1
operations/puppetproduction+8 -4
mediawiki/extensions/Wikibasemaster+60 -47
mediawiki/extensions/Wikibasemaster+102 -42
mediawiki/extensions/Wikibasemaster+360 -68
operations/mediawiki-configmaster+1 -0
operations/mediawiki-configmaster+1 -0
mediawiki/extensions/Wikibasemaster+1 -1
mediawiki/extensions/Wikidatamaster+1 -1
mediawiki/extensions/Wikidatamaster+1 -1
mediawiki/corewmf/1.30.0-wmf.15+7 -3
mediawiki/coremaster+7 -3
mediawiki/extensions/Wikidatawmf/1.30.0-wmf.15+1 -1
mediawiki/extensions/Wikidatamaster+1 -1
mediawiki/corewmf/1.30.0-wmf.15+14 -2
mediawiki/coremaster+14 -2
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 376562 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

This patch can go in when commons is on wmf.17. Sooner, it's useless. (See T174422: Make dbBatchSize in WikiPageUpdater configurable)

Change 377046 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

Change 376562 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

Change 377458 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Change 377458 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Mentioned in SAL (#wikimedia-operations) [2017-09-12T13:12:13Z] <hashar@tin> Synchronized wmf-config/Wikibase-production.php: Reduce wikiPageUpdaterDbBatchSize to 20 - T173710 (duration: 00m 45s)

Change 377811 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377811

Change 378719 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Change 375819 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Pass root job params through WikiPageUpdater

https://gerrit.wikimedia.org/r/375819

FWIW we're seeing another almost-incontrollable growth of jobs on commons and probably other wikis. I might decide to raise the concurrency of those jobs.

Change 377046 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

Change 378719 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

Delays with pushing updates into search could potentially be related to the job queue. More than 12 hours is pretty exceptional for processing these, but the refreshLinks job has to run and on completion that triggers the search index update jobs. refresh links is one of the ones that we've been seeing backup from time to time.

The jobqueue size just bumped to 12M in two days and it's not going down. I don't know if it's related to wikidata or not but that's something people need to look into.

oblivian@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
cawiki:  refreshLinks: 104355 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
commonswiki:  refreshLinks: 2073193 queued; 44 claimed (21 active, 23 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 1583627 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
commonswiki:  cirrusSearchLinksUpdate: 5311248 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
oblivian@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group2.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
arwiki:  refreshLinks: 94729 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
cewiki:  refreshLinks: 128373 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cewiki:  htmlCacheUpdate: 25677 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enwiki:  refreshLinks: 83152 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
enwiki:  htmlCacheUpdate: 33670 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
frwiki:  refreshLinks: 18401 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
hywiki:  refreshLinks: 91297 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 94906 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 1102450 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 518089 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 1083039 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
svwiki:  htmlCacheUpdate: 144734 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
ukwiki:  refreshLinks: 14833 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
zhwiki:  refreshLinks: 23192 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
zhwiki:  htmlCacheUpdate: 19334 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

It is pretty clear to me that one of the reasons was a namespace move we had on commons, but the underlying problem is that the amount of refreshlink jobs and htmlcacheupdate jobs has spun out of control.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

I'm not sure if we need more hardware, or just more effective use of the current hardware. The cirrus jobs in particular are almost entirely bound by network latency, and can be run at significantly higher rates than they are now. Over the course of an hour I ramped up the speed at which these jobs were processing (with some bare hhvm processes on 9 eqiad job runners using runJobs.php) to about 200 extra job runners. Total job queue throughput has increased significantly from 60k jobs/minute to 100k jobs/minute and the job runners themselves are still at ~40% idle. This of course is hard to generalize to jobs in general though, as they will use remote resources that may or may not be available. I happen to know what this specific job will do and how it should behave, but just generally increasing # of job runners per server across the fleet is perhaps not as easy to understand what will happen.

debt subscribed.

Taking off the Discovery-Search tag, as there isn't much we can do, but we'll continue to monitor using the Discovery-ARCHIVED tag.

Updated list (showjob1.txt contains group1, showjob.txt group2)

elukey@terbium:~$ awk '{if ($3 > 100000) print $_}' showjob1.txt
commonswiki:  refreshLinks: 1629991 queued; 5 claimed (2 active, 3 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 2470968 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

elukey@terbium:~$ awk '{if ($3 > 100000) print $_}' showjob.txt
arwiki:  refreshLinks: 142867 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 198068 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 807144 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 916539 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 1582067 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
svwiki:  htmlCacheUpdate: 186069 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

The job queue is currently around 8/9M elements and the trend doesn't seem to improve.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

Just as FYI we added new capacity in T165519 (now we have 19 jobrunners running in eqiad), but in theory we'd need to eventually decom mw[1161-1167] for T177387 to free some rack space and complete T165519 (other jobrunners will eventually be added as part of the task but we need to free rack space first).

I think one of the reasons contributing to the problem is the same problem we had with T171027: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis, we stopped emitting injectRCRecord jobs but we are still emit refreshlinks jobs to commonswiki, People are trying to make the whole thing more efficient but I guess it takes some time, we can spin up more job runners but that's not my call to make.

I have thought a bit about ways to mitigate this. Here are three things I think could help:

I think one of the reasons contributing to the problem is the same problem we had with T171027: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis, we stopped emitting injectRCRecord jobs but we are still emit refreshlinks jobs to commonswiki, People are trying to make the whole thing more efficient but I guess it takes some time, we can spin up more job runners but that's not my call to make.

Hello, I'm a technician at ruwiki, and our wiki is one of those that were experiencing the T171027 problem the most. In the same time, I've noticed in the stats presented in the comments above that the numbers for ruwiki are constantly one of the highest. Could it be connected?

Updated status:

elukey@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
cawiki:  refreshLinks: 13566 queued; 6 claimed (6 active, 0 abandoned); 0 delayed
commonswiki:  refreshLinks: 1991671 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 3760683 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

elukey@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group2.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
arwiki:  refreshLinks: 120524 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
bewiki:  refreshLinks: 34551 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
cewiki:  refreshLinks: 142590 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
cewiki:  htmlCacheUpdate: 150593 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
dewiki:  htmlCacheUpdate: 11027 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enwiki:  refreshLinks: 69933 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
enwiki:  htmlCacheUpdate: 127930 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
frwiki:  refreshLinks: 41595 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
hywiki:  refreshLinks: 95960 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 240479 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
itwiki:  htmlCacheUpdate: 70493 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 985639 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 1928674 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 679490 queued; 8 claimed (8 active, 0 abandoned); 0 delayed

We could try to increment the number of runners dedicated to refreshLinks and htmlCacheUpdate and see if we manage to process the backlog?

Change 386636 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::jobrunner: raise temporarily runners for refreshLinks/hmtlCacheUpdate

https://gerrit.wikimedia.org/r/386636

On ruwiki, many editors are complaining about slow updating of pages with their templates. We have a huge job queue, and it keeps growing day by day, while no top-used templates/modules have been changed in the last days.

Please tell, is there any advice that could be given to us, as well as other local communities suffering from this?

Change 386636 merged by Elukey:
[operations/puppet@production] role::mediawiki::jobrunner: inc runners for refreshLinks/htmlCacheUpdate

https://gerrit.wikimedia.org/r/386636

Mentioned in SAL (#wikimedia-operations) [2017-10-30T08:42:42Z] <elukey> raised priority of refreshlink and htmlcacheupdate job execution on jobrunners (https://gerrit.wikimedia.org/r/#/c/386636/) - T173710

On ruwiki, many editors are complaining about slow updating of pages with their templates. We have a huge job queue, and it keeps growing day by day, while no top-used templates/modules have been changed in the last days.

Please tell, is there any advice that could be given to us, as well as other local communities suffering from this?

Hi! We are trying to solve the issue from two sides, namely trying to produce less jobs and prioritizing more the consumption of the current backlog (mostly htmlCacheUpdate and RefreshLinks jobs). At the moment I don't think there is any good advice for local communities, we are hoping to reduce the backlog soon but it might take a while :(

Thanks for the reply. It just surprises me that on enwiki, the job queue is very lightweight, while on ruwiki, it's 2/3 of the overall pages count, and enwiki is much more active. Is it because of wide use of Wikidata in ruwiki?

Thanks for the reply. It just surprises me that on enwiki, the job queue is very lightweight, while on ruwiki, it's 2/3 of the overall pages count, and enwiki is much more active. Is it because of wide use of Wikidata in ruwiki?

Yes, that's the reason.

We had some relief after the last change in the configs of the jobrunners, namely the queue started shrinking, but then we got back into the bad behavior in which we have constantly more jobs enqueued vs completed:

Screen Shot 2017-10-30 at 6.19.11 PM.png (335×939 px, 36 KB)

I am currently seeing some big rootjobs with timestamp around Oct 27th that keep seeing jobs executed, but I failed to track down what it has generated them. If anybody has any idea about what procedure to follow to track down the root cause of this job queue increase please come forward :)

All jobs have a requestId parameter, which is passed down through the execution chain. This is the same as the reqId field in logstash. Basically this means if the originating request logged anything to logstash, you should be able to find it with the query type:mediawiki reqId:xxxxx and looking for the very first message. That assumes of course the initial request logged anything.

All jobs have a requestId parameter, which is passed down through the execution chain. This is the same as the reqId field in logstash. Basically this means if the originating request logged anything to logstash, you should be able to find it with the query type:mediawiki reqId:xxxxx and looking for the very first message. That assumes of course the initial request logged anything.

Thanks! I tried to spot check in logstash but I am able to see only the request that starts from the jobrunner (the one executing the job), not much more .. :(

https://gerrit.wikimedia.org/r/#/c/385248 is really really promising, not sure when it will be deployed but it would surely help in finding quickly a massive template change or similar.

It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the backlog is being processed it may not look like it, because the jobs are just enqueing new jobs. Will probably take some time to really know what effect things are having.

It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the backlog is being processed it may not look like it, because the jobs are just enqueing new jobs. Will probably take some time to really know what effect things are having.

We are basically lagging 2/3 days in executing jobs, but the queue keeps growing and I have no idea if we are facing something like T129517#2128754 or a 'genuine' (recursive) job enqueue explosion due to a template modification or similar.

https://gerrit.wikimedia.org/r/#/c/385248 should be already working for commons, but from mwlog1001's runJob.log I can only see stuff like causeAction=unknown causeAgent=unknown (that probably only confirms that no authenticated user/bot is triggering these jobs iteratively).

Status:

elukey@terbium:~$ mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php |sort -n -k2 | tail -n 20
euwiki 237
tgwiki 3759
cawiki 4822
enwiktionary 17148
zhwiki 19958
nowiki 21167
wikidatawiki 28257
bewiki 110296
arwiki 132139
ukwiki 132246
dewiki 155322
svwiki 179250
frwiki 214327
hywiki 504377
itwiki 512539
cewiki 593156
enwiki 654998
ruwiki 5274159
commonswiki 8059943

Total 16619065

https://gerrit.wikimedia.org/r/#/c/385248 should be already working for commons, but from mwlog1001's runJob.log I can only see stuff like causeAction=unknown causeAgent=unknown (that probably only confirms that no authenticated user/bot is triggering these jobs iteratively).

The unknown causes may also stem from the fact that the patch was not active when the initial job was executed, and so its descendants can't know the cause.

Change 388416 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Increase concurrency for htmlCacheUpdate

https://gerrit.wikimedia.org/r/388416

Change 388416 merged by Giuseppe Lavagetto:
[operations/mediawiki-config@master] Increase concurrency for htmlCacheUpdate

https://gerrit.wikimedia.org/r/388416

Mentioned in SAL (#wikimedia-operations) [2017-11-03T10:39:07Z] <oblivian@tin> Synchronized wmf-config/CommonSettings.php: Increase concurrency of htmlCacheUpdate jobs T173710 (duration: 00m 48s)

Change 389427 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] jobrunner: make refreshlinks jobs low-priority

https://gerrit.wikimedia.org/r/389427

Change 389427 merged by Giuseppe Lavagetto:
[operations/puppet@production] jobrunner: make refreshlinks jobs low-priority

https://gerrit.wikimedia.org/r/389427

Mentioned in SAL (#wikimedia-operations) [2017-11-06T09:37:49Z] <_joe_> manually running htmlCacheUpdate for commonswiki and ruwiki on terbium, T173710

@Aklapper Probably, but I would close that one, as that should not be happening right now, unless you have reports saying it is again.

I'm working on for a proper solution for refresh links jobs that are triggered from Wikidata, Made lots of progress in the past couple of weeks hopefully this will be resolved by the next week.

The jobqueue size has been reduced to 1.6M and will go down more once we enable "lua fine grained usage tracking" and "statement fine grained usage tracking" in more wikis.

The jobqueue size [..] will go down more once we enable "lua fine grained usage tracking" and "statement fine grained usage tracking" in more wikis.

Ref: T184322: Enable fine grained lua tracking gradually in client wikis