⚓ T173710 Job queue is increasing non-stop

Dashboard: Job Queue Helath

Job queue size down from 10M to a steady ~2M. Before the regression it was a steady between 100K -1M.

Dashboard: Varnish stats

Purge rate from production Varnish servers reduced by 2-3X, from 75-100K/s to ~30K/s.

Dashboard: Job Queue Rate for htmlCacheUpdate

Queue rate of `htmlCacheUpdate` back to normal. Deduplication/Superseding optimisation is now working. Execution speed has increased.

Subject	Repo	Branch	Lines +/-
jobrunner: make refreshlinks jobs low-priority	operations/puppet	production	+5 -1
Increase concurrency for htmlCacheUpdate	operations/mediawiki-config	master	+1 -1
role::mediawiki::jobrunner: inc runners for refreshLinks/htmlCacheUpdate	operations/puppet	production	+8 -4
Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations	mediawiki/extensions/Wikibase	master	+60 -47
Allow batch sizes for different jobs to be defined separately.	mediawiki/extensions/Wikibase	master	+102 -42
Pass root job params through WikiPageUpdater	mediawiki/extensions/Wikibase	master	+360 -68
Reduce wikiPageUpdaterDbBatchSize to 20	operations/mediawiki-config	master	+1 -0
Reduce wikiPageUpdaterDbBatchSize to 20	operations/mediawiki-config	master	+1 -0
Decrease dbBatchSize in WikiPageUpdater	mediawiki/extensions/Wikibase	master	+1 -1
Hotfix: Reduce batch size in WikiPageUpdater	mediawiki/extensions/Wikidata	master	+1 -1
Hotfix: Reduce batch size in WikiPageUpdater	mediawiki/extensions/Wikidata	master	+1 -1
Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob	mediawiki/core	wmf/1.30.0-wmf.15	+7 -3
Disable rebound CDN purges for backlinks in HTMLCacheUpdateJob	mediawiki/core	master	+7 -3
Hotfix: Reduce batch size in WikiPageUpdater	mediawiki/extensions/Wikidata	wmf/1.30.0-wmf.15	+1 -1
Hotfix: Reduce batch size in WikiPageUpdater	mediawiki/extensions/Wikidata	master	+1 -1
Make workItemCount() smarter for htmlCacheUpdate/refreshLinks	mediawiki/core	wmf/1.30.0-wmf.15	+14 -2
Make workItemCount() smarter for htmlCacheUpdate/refreshLinks	mediawiki/core	master	+14 -2

Status	Assigned	Task
Resolved	aaron	T175897 Audit and improve JobQueue stability and performance (2017)
Resolved	Ladsgroup	T173710 Job queue is increasing non-stop
Resolved	daniel	T174422 Make dbBatchSize in WikiPageUpdater configurable
Open	None	T178804 When processing changes to Wikibase SiteLinks on the client, only trigger updates for sitelinks that are actually shown in the sidebar.
Resolved	Addshore	T178806 Wikibase: Batch HTMLCacheUpdateJobs across changes
Declined	None	T178810 Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes.

Change 376562 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

This patch can go in when commons is on wmf.17. Sooner, it's useless. (See T174422: Make dbBatchSize in WikiPageUpdater configurable)

• mobrovac mentioned this in T174993: Vandalism in "In the news" articles persisting in the app ?.Sep 8 2017, 1:02 PM

Change 377046 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

Change 376562 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/376562

Change 377458 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Change 377458 merged by jenkins-bot:
[operations/mediawiki-config@master] Reduce wikiPageUpdaterDbBatchSize to 20

https://gerrit.wikimedia.org/r/377458

Mentioned in SAL (#wikimedia-operations) [2017-09-12T13:12:13Z] <hashar@tin> Synchronized wmf-config/Wikibase-production.php: Reduce wikiPageUpdaterDbBatchSize to 20 - T173710 (duration: 00m 45s)

Change 377811 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@master] Split page set before constructing InjectRCRecordsJob

https://gerrit.wikimedia.org/r/377811

Harej unsubscribed.Sep 13 2017, 6:17 PM

• Tbayer subscribed.Sep 13 2017, 9:09 PM

Krinkle added a parent task: T175897: Audit and improve JobQueue stability and performance (2017).Sep 14 2017, 8:50 AM

Change 378719 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Change 375819 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Pass root job params through WikiPageUpdater

https://gerrit.wikimedia.org/r/375819

FWIW we're seeing another almost-incontrollable growth of jobs on commons and probably other wikis. I might decide to raise the concurrency of those jobs.

YOUR1 subscribed.Sep 20 2017, 7:18 AM

thiemowmde closed subtask T174422: Make dbBatchSize in WikiPageUpdater configurable as Resolved.Sep 20 2017, 10:19 AM

daniel reopened subtask T174422: Make dbBatchSize in WikiPageUpdater configurable as Open.Sep 20 2017, 11:30 AM

Change 377046 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow batch sizes for different jobs to be defined separately.

https://gerrit.wikimedia.org/r/377046

daniel closed subtask T174422: Make dbBatchSize in WikiPageUpdater configurable as Resolved.Sep 20 2017, 2:29 PM

Change 378719 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Refactor possibly fragile ChangeHandler/WikiPageUpdater hash calculations

https://gerrit.wikimedia.org/r/378719

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

In T173710#3625333, @Daimona wrote:

Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix. This search: https://it.wikipedia.org/w/index.php?search=insource%3A%2F%27%27parlate+prego%27%27%5C%3C%5C%2F%2F&title=Speciale:Ricerca&profile=all&fulltext=1&searchToken=8w6a8h4kmdl0csochsal7380n
now has 6 results, but they're all fixed since yesterday. The weird thing is, today at 11AM the search only returned something like 4 results, while the other (already fixed) pages were added at around 4PM. We suppose that this is still due to troubles with job queue, is that right?

Delays with pushing updates into search could potentially be related to the job queue. More than 12 hours is pretty exceptional for processing these, but the refreshLinks job has to run and on completion that triggers the search index update jobs. refresh links is one of the ones that we've been seeing backup from time to time.

The jobqueue size just bumped to 12M in two days and it's not going down. I don't know if it's related to wikidata or not but that's something people need to look into.

oblivian@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
cawiki:  refreshLinks: 104355 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
commonswiki:  refreshLinks: 2073193 queued; 44 claimed (21 active, 23 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 1583627 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
commonswiki:  cirrusSearchLinksUpdate: 5311248 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
oblivian@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group2.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
arwiki:  refreshLinks: 94729 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
cewiki:  refreshLinks: 128373 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
cewiki:  htmlCacheUpdate: 25677 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enwiki:  refreshLinks: 83152 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
enwiki:  htmlCacheUpdate: 33670 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
frwiki:  refreshLinks: 18401 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
hywiki:  refreshLinks: 91297 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 94906 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 1102450 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 518089 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 1083039 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
svwiki:  htmlCacheUpdate: 144734 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
ukwiki:  refreshLinks: 14833 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
zhwiki:  refreshLinks: 23192 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
zhwiki:  htmlCacheUpdate: 19334 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

It is pretty clear to me that one of the reasons was a namespace move we had on commons, but the underlying problem is that the amount of refreshlink jobs and htmlcacheupdate jobs has spun out of control.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

I'm not sure if we need more hardware, or just more effective use of the current hardware. The cirrus jobs in particular are almost entirely bound by network latency, and can be run at significantly higher rates than they are now. Over the course of an hour I ramped up the speed at which these jobs were processing (with some bare hhvm processes on 9 eqiad job runners using runJobs.php) to about 200 extra job runners. Total job queue throughput has increased significantly from 60k jobs/minute to 100k jobs/minute and the job runners themselves are still at ~40% idle. This of course is hard to generalize to jobs in general though, as they will use remote resources that may or may not be available. I happen to know what this specific job will do and how it should behave, but just generally increasing # of job runners per server across the fleet is perhaps not as easy to understand what will happen.

• Tbayer mentioned this in T171881: CL support for Wikipedia Zero piracy problems.Oct 5 2017, 1:45 AM

mxn subscribed.Oct 14 2017, 7:31 PM

Taking off the Discovery-Search tag, as there isn't much we can do, but we'll continue to monitor using the Discovery-ARCHIVED tag.

Updated list (showjob1.txt contains group1, showjob.txt group2)

elukey@terbium:~$ awk '{if ($3 > 100000) print $_}' showjob1.txt
commonswiki:  refreshLinks: 1629991 queued; 5 claimed (2 active, 3 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 2470968 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

elukey@terbium:~$ awk '{if ($3 > 100000) print $_}' showjob.txt
arwiki:  refreshLinks: 142867 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 198068 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 807144 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 916539 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 1582067 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
svwiki:  htmlCacheUpdate: 186069 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

The job queue is currently around 8/9M elements and the trend doesn't seem to improve.

In T173710#3646384, @Joe wrote:

I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the approach to the problem or throw more hardware at it.

Just as FYI we added new capacity in T165519 (now we have 19 jobrunners running in eqiad), but in theory we'd need to eventually decom mw[1161-1167] for T177387 to free some rack space and complete T165519 (other jobrunners will eventually be added as part of the task but we need to free rack space first).

I think one of the reasons contributing to the problem is the same problem we had with T171027: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis, we stopped emitting injectRCRecord jobs but we are still emit refreshlinks jobs to commonswiki, People are trying to make the whole thing more efficient but I guess it takes some time, we can spin up more job runners but that's not my call to make.

daniel mentioned this in T178810: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes..Oct 23 2017, 3:30 PM

I have thought a bit about ways to mitigate this. Here are three things I think could help:

T178804: When processing changes to Wikibase SiteLinks on the client, only trigger updates for sitelinks that are actually shown in the sidebar. (doable in a weeks or two, if prioritized)
T178806: Wikibase: Batch HTMLCacheUpdateJobs across changes (doable in a weeks or two, if prioritized)
T178810: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes. (doable now, it's a one line config change)

Jack_who_built_the_house subscribed.Oct 24 2017, 12:35 AM

In T173710#3701806, @Ladsgroup wrote:

I think one of the reasons contributing to the problem is the same problem we had with T171027: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis, we stopped emitting injectRCRecord jobs but we are still emit refreshlinks jobs to commonswiki, People are trying to make the whole thing more efficient but I guess it takes some time, we can spin up more job runners but that's not my call to make.

Hello, I'm a technician at ruwiki, and our wiki is one of those that were experiencing the T171027 problem the most. In the same time, I've noticed in the stats presented in the comments above that the numbers for ruwiki are constantly one of the highest. Could it be connected?

elukey added a project: User-Elukey.Oct 24 2017, 9:40 AM

Aklapper mentioned this in T178840: Commons category updates are very slow. Commons job queue very large..Oct 24 2017, 12:14 PM

RP88 subscribed.Oct 24 2017, 2:06 PM

zhuyifei1999 subscribed.Oct 24 2017, 4:00 PM

Updated status:

elukey@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
cawiki:  refreshLinks: 13566 queued; 6 claimed (6 active, 0 abandoned); 0 delayed
commonswiki:  refreshLinks: 1991671 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 3760683 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

elukey@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group2.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
arwiki:  refreshLinks: 120524 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
bewiki:  refreshLinks: 34551 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
cewiki:  refreshLinks: 142590 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
cewiki:  htmlCacheUpdate: 150593 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
dewiki:  htmlCacheUpdate: 11027 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
enwiki:  refreshLinks: 69933 queued; 4 claimed (4 active, 0 abandoned); 0 delayed
enwiki:  htmlCacheUpdate: 127930 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
frwiki:  refreshLinks: 41595 queued; 5 claimed (5 active, 0 abandoned); 0 delayed
hywiki:  refreshLinks: 95960 queued; 3 claimed (3 active, 0 abandoned); 0 delayed
itwiki:  refreshLinks: 240479 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
itwiki:  htmlCacheUpdate: 70493 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
ruwiki:  refreshLinks: 985639 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
ruwiki:  htmlCacheUpdate: 1928674 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
svwiki:  refreshLinks: 679490 queued; 8 claimed (8 active, 0 abandoned); 0 delayed

We could try to increment the number of runners dedicated to refreshLinks and htmlCacheUpdate and see if we manage to process the backlog?

Change 386636 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mediawiki::jobrunner: raise temporarily runners for refreshLinks/hmtlCacheUpdate

https://gerrit.wikimedia.org/r/386636

thiemowmde added subtasks: T178804: When processing changes to Wikibase SiteLinks on the client, only trigger updates for sitelinks that are actually shown in the sidebar., T178806: Wikibase: Batch HTMLCacheUpdateJobs across changes, T178810: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes..Oct 27 2017, 12:35 PM

thiemowmde updated the task description. (Show Details)Oct 27 2017, 12:41 PM

On ruwiki, many editors are complaining about slow updating of pages with their templates. We have a huge job queue, and it keeps growing day by day, while no top-used templates/modules have been changed in the last days.

Please tell, is there any advice that could be given to us, as well as other local communities suffering from this?

Change 386636 merged by Elukey:
[operations/puppet@production] role::mediawiki::jobrunner: inc runners for refreshLinks/htmlCacheUpdate

https://gerrit.wikimedia.org/r/386636

Mentioned in SAL (#wikimedia-operations) [2017-10-30T08:42:42Z] <elukey> raised priority of refreshlink and htmlcacheupdate job execution on jobrunners (https://gerrit.wikimedia.org/r/#/c/386636/) - T173710

In T173710#3717940, @Jack_who_built_the_house wrote:

On ruwiki, many editors are complaining about slow updating of pages with their templates. We have a huge job queue, and it keeps growing day by day, while no top-used templates/modules have been changed in the last days.

Please tell, is there any advice that could be given to us, as well as other local communities suffering from this?

Hi! We are trying to solve the issue from two sides, namely trying to produce less jobs and prioritizing more the consumption of the current backlog (mostly htmlCacheUpdate and RefreshLinks jobs). At the moment I don't think there is any good advice for local communities, we are hoping to reduce the backlog soon but it might take a while :(

Thanks for the reply. It just surprises me that on enwiki, the job queue is very lightweight, while on ruwiki, it's 2/3 of the overall pages count, and enwiki is much more active. Is it because of wide use of Wikidata in ruwiki?

In T173710#3718725, @Jack_who_built_the_house wrote:

Thanks for the reply. It just surprises me that on enwiki, the job queue is very lightweight, while on ruwiki, it's 2/3 of the overall pages count, and enwiki is much more active. Is it because of wide use of Wikidata in ruwiki?

Yes, that's the reason.

We had some relief after the last change in the configs of the jobrunners, namely the queue started shrinking, but then we got back into the bad behavior in which we have constantly more jobs enqueued vs completed:

Screen Shot 2017-10-30 at 6.19.11 PM.png (335×939 px, 36 KB)

I am currently seeing some big rootjobs with timestamp around Oct 27th that keep seeing jobs executed, but I failed to track down what it has generated them. If anybody has any idea about what procedure to follow to track down the root cause of this job queue increase please come forward :)

All jobs have a requestId parameter, which is passed down through the execution chain. This is the same as the reqId field in logstash. Basically this means if the originating request logged anything to logstash, you should be able to find it with the query type:mediawiki reqId:xxxxx and looking for the very first message. That assumes of course the initial request logged anything.

In T173710#3720358, @EBernhardson wrote:

All jobs have a requestId parameter, which is passed down through the execution chain. This is the same as the reqId field in logstash. Basically this means if the originating request logged anything to logstash, you should be able to find it with the query type:mediawiki reqId:xxxxx and looking for the very first message. That assumes of course the initial request logged anything.

Thanks! I tried to spot check in logstash but I am able to see only the request that starts from the jobrunner (the one executing the job), not much more .. :(

https://gerrit.wikimedia.org/r/#/c/385248 is really really promising, not sure when it will be deployed but it would surely help in finding quickly a massive template change or similar.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Nov 2 2017, 12:49 PM

It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the backlog is being processed it may not look like it, because the jobs are just enqueing new jobs. Will probably take some time to really know what effect things are having.

In T173710#3730226, @EBernhardson wrote:

It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the backlog is being processed it may not look like it, because the jobs are just enqueing new jobs. Will probably take some time to really know what effect things are having.

We are basically lagging 2/3 days in executing jobs, but the queue keeps growing and I have no idea if we are facing something like T129517#2128754 or a 'genuine' (recursive) job enqueue explosion due to a template modification or similar.

https://gerrit.wikimedia.org/r/#/c/385248 should be already working for commons, but from mwlog1001's runJob.log I can only see stuff like causeAction=unknown causeAgent=unknown (that probably only confirms that no authenticated user/bot is triggering these jobs iteratively).

Status:

elukey@terbium:~$ mwscript extensions/WikimediaMaintenance/getJobQueueLengths.php |sort -n -k2 | tail -n 20
euwiki 237
tgwiki 3759
cawiki 4822
enwiktionary 17148
zhwiki 19958
nowiki 21167
wikidatawiki 28257
bewiki 110296
arwiki 132139
ukwiki 132246
dewiki 155322
svwiki 179250
frwiki 214327
hywiki 504377
itwiki 512539
cewiki 593156
enwiki 654998
ruwiki 5274159
commonswiki 8059943

Total 16619065

In T173710#3730359, @elukey wrote:

https://gerrit.wikimedia.org/r/#/c/385248 should be already working for commons, but from mwlog1001's runJob.log I can only see stuff like causeAction=unknown causeAgent=unknown (that probably only confirms that no authenticated user/bot is triggering these jobs iteratively).

The unknown causes may also stem from the fact that the patch was not active when the initial job was executed, and so its descendants can't know the cause.

Change 388416 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/mediawiki-config@master] Increase concurrency for htmlCacheUpdate

https://gerrit.wikimedia.org/r/388416

Change 388416 merged by Giuseppe Lavagetto:
[operations/mediawiki-config@master] Increase concurrency for htmlCacheUpdate

https://gerrit.wikimedia.org/r/388416

Mentioned in SAL (#wikimedia-operations) [2017-11-03T10:39:07Z] <oblivian@tin> Synchronized wmf-config/CommonSettings.php: Increase concurrency of htmlCacheUpdate jobs T173710 (duration: 00m 48s)

XXN unsubscribed.Nov 3 2017, 6:17 PM

Change 389427 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] jobrunner: make refreshlinks jobs low-priority

https://gerrit.wikimedia.org/r/389427

Change 389427 merged by Giuseppe Lavagetto:
[operations/puppet@production] jobrunner: make refreshlinks jobs low-priority

https://gerrit.wikimedia.org/r/389427

Mentioned in SAL (#wikimedia-operations) [2017-11-06T09:37:49Z] <_joe_> manually running htmlCacheUpdate for commonswiki and ruwiki on terbium, T173710

elukey moved this task from In Progress to Stalled on the User-Elukey board.Nov 7 2017, 10:32 AM

Krinkle added a commit: rMW8f829de5f040: Add action/user tracking to link refresh jobs.Nov 8 2017, 9:01 AM

Krinkle updated the task description. (Show Details)Nov 8 2017, 9:03 AM

Krinkle added a commit: rEWBA081c1cd78f9d: Add cause action/agent tracking to link refresh jobs.Nov 8 2017, 11:04 AM

elukey moved this task from Stalled to Keep an eye on it on the User-Elukey board.Nov 17 2017, 10:25 AM

Lydia_Pintscher removed a project: Wikidata-Former-Sprint-Board.Nov 21 2017, 8:53 AM

Is T178840: Commons category updates are very slow. Commons job queue very large. a duplicate?

@Aklapper Probably, but I would close that one, as that should not be happening right now, unless you have reports saying it is again.

Addshore subscribed.Dec 7 2017, 3:24 PM

Krinkle merged a task: T178840: Commons category updates are very slow. Commons job queue very large..Dec 15 2017, 1:40 AM

Krinkle added a subscriber: Jeff_G.

I'm working on for a proper solution for refresh links jobs that are triggered from Wikidata, Made lots of progress in the past couple of weeks hopefully this will be resolved by the next week.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJan 30 2018, 11:39 AM

Krinkle removed a project: Patch-For-Review.Feb 13 2018, 1:15 AM

Krinkle removed projects: Discovery-ARCHIVED, CirrusSearch.

The jobqueue size has been reduced to 1.6M and will go down more once we enable "lua fine grained usage tracking" and "statement fine grained usage tracking" in more wikis.

Liuxinyu970226 unsubscribed.Feb 13 2018, 1:35 PM

In T173710#3966996, @Ladsgroup wrote:

The jobqueue size [..] will go down more once we enable "lua fine grained usage tracking" and "statement fine grained usage tracking" in more wikis.

Ref: T184322: Enable fine grained lua tracking gradually in client wikis

Krinkle closed subtask T178810: Wikibase: Increase batch size for HTMLCacheUpdateJobs triggered by repo changes. as Declined.Jul 11 2018, 3:04 AM

Addshore closed subtask T178806: Wikibase: Batch HTMLCacheUpdateJobs across changes as Resolved.Nov 26 2019, 9:28 AM

Maintenance_bot moved this task from Incoming to Done on the User-Ladsgroup board.Nov 26 2019, 12:21 PM

rEWBA extension-Wikibase
	rEWBA081c1cd78f9d Add cause action/agent tracking to link refresh jobs
rMW MediaWiki
	rMW8f829de5f040 Add action/user tracking to link refresh jobs

Job queue is increasing non-stop
Closed, ResolvedPublic
Actions

Description

Mitigation

Details

Revisions and Commits

Related Objects
Search...

Event Timeline

	F10519970: Screen Shot 2017-10-30 at 6.19.11 PM.png
	Oct 30 2017, 5:23 PM

	F9232855: Screen Shot 2017-08-31 at 00.21.36.png
	Aug 30 2017, 11:27 PM

	F9232784: Screen Shot 2017-08-31 at 00.15.34.png
	Aug 30 2017, 11:27 PM

	F9232210: Screen Shot 2017-08-31 at 00.00.31.png
	Aug 30 2017, 11:01 PM

	F9232209: Screen Shot 2017-08-31 at 00.00.18.png
	Aug 30 2017, 11:01 PM

	F9228251: RefreshLinksJob.svg
	Aug 31 2017, 6:26 PM

	F9141847: image.png
	Aug 21 2017, 12:27 PM

Job queue is increasing non-stopClosed, ResolvedPublicActions