Page MenuHomePhabricator

Multiple projects reporting Cannot access the database: No working replica DB server
Closed, ResolvedPublicPRODUCTION ERROR

Assigned To
Authored By
Boshomi
May 24 2018, 7:32 PM
Referenced Files
F18514381: Screenshot_20180524-171600-01.jpeg
May 24 2018, 9:59 PM
F18514237: 2018-05-24_22-40-12_preview.jpeg
May 24 2018, 9:09 PM
F18512793: 33378763_2097967926898471_8812372984074338304_n.jpg
May 24 2018, 7:51 PM
F18512789: capture-20180524-214313.png
May 24 2018, 7:51 PM
Tokens
"The World Burns" token, awarded by Liuxinyu970226."The World Burns" token, awarded by Ivanhercaz."The World Burns" token, awarded by Addshore."100" token, awarded by Davey2010."Burninate" token, awarded by TheresNoTime."Heartbreak" token, awarded by Effeietsanders.

Description

just now i got this:

(Auf die Datenbank konnte nicht zugegriffen werden: Cannot access the database: No working replica DB server: Unknown error (10.64.32.198:3318))

Error also:
(Cannot access the database: Cannot access the database: No working replica DB server: Unknown error (10.64.32.113))

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 435044 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.4] Do not register SpecialItemDisambiguation to stop DoS.

https://gerrit.wikimedia.org/r/435044

A hot fix has been applied to keep the sites up. Wikidata functions may be returning incorrect data in the meantime.

Also property suggester is disabled and article placeholder won't show much and also Special:ItemDisambiguation

I confirm the language for the aliens. Not user-friendly.

2018-05-24_22-40-12_preview.jpeg (168×1 px, 31 KB)

There is a separate task about the broken text: T195525: MWExceptionRenderer.php doesn't always declare the encoding used. It already has a patch pending, so in case another issue like this occurs, at least the error message will be displayed correctly ;)

Change 435055 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.5] Log the query that would hit wb_terms

https://gerrit.wikimedia.org/r/435055

Change 435057 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.4] Log the query that would hit wb_terms.

https://gerrit.wikimedia.org/r/435057

daniel added subscribers: Lydia_Pintscher, daniel.

Pinging @Lydia_Pintscher. This mostly affected Wikidiata, the site was down for a while (an hour)? As far as I can tell, this is unrelated to the lexeme deployment. It may be related to the wb_terms work @Ladsgroup was doing, and/or something hitting Special:ItemDisambiguation rather hard. Analysis is ongoing.

Status summary:

There are some suggestions of further actions on https://etherpad.wikimedia.org/p/wb_terms_solution. Needs cleaning up and proper task breakdown.

Logging more info now, see https://gerrit.wikimedia.org/r/q/Id9fdc74829e6268ecc3861602adf6666c2eaffc4

Some users on ptwiki related that coudn't access the Meta, rollback editions or see diffs. And when I tried access wikidata appeared a server error warning:

Screenshot_20180524-171600-01.jpeg (834×1 px, 192 KB)

Change 435057 merged by Ladsgroup:
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.4] Log the query that would hit wb_terms.

https://gerrit.wikimedia.org/r/435057

Change 435055 merged by Ladsgroup:
[mediawiki/extensions/Wikibase@wmf/1.32.0-wmf.5] Log the query that would hit wb_terms.

https://gerrit.wikimedia.org/r/435055

Mentioned in SAL (#wikimedia-operations) [2018-05-24T22:11:42Z] <ladsgroup@tin> Synchronized php-1.32.0-wmf.4/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Log the query that would hit wb_terms. (T195520) (duration: 01m 21s)

It appears to be resolved now in arabic wikipedia

Mentioned in SAL (#wikimedia-operations) [2018-05-24T22:21:12Z] <ladsgroup@tin> Synchronized php-1.32.0-wmf.5/extensions/Wikibase/lib/includes/Store/Sql/TermSqlIndex.php: Log the query that would hit wb_terms. (T195520) (duration: 01m 20s)

Change 435079 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Use TypeDispatchingEntitySearchHelper for SearchHelper in several places

https://gerrit.wikimedia.org/r/435079

Vachovec1 subscribed.

Added Wikimedia-Incident tag. The incident report already exists...

Marostegui lowered the priority of this task from Unbreak Now! to High.EditedMay 25 2018, 9:23 AM

Lowering priority as the mitigation was set yesterday EU night.

Suggestions for the future:

  • Don't use temporary names for tables which may become permanent.
  • Name and describe tables appropriately, especially including those that have temporary names.
  • Use invisible indexes for a while before actually dropping.

The TermSqlIndex::getMatchingTerms method which was patched out during the incident has now been added back.

* 15:11 addshore: Wikibase - Re enable wb_terms things window done
* 14:57 addshore@tin: Synchronized php-1.32.0-wmf.5/extensions/Wikibase: [[gerrit:436007{{!}}TermSqlIndex::getMatchingTerms actually execute select]] (duration: 02m 19s)
* 14:49 addshore@tin: Synchronized php-1.32.0-wmf.4/extensions/Wikibase: [[gerrit:436006{{!}}TermSqlIndex::getMatchingTerms actually execute select]] (duration: 02m 18s)
* 14:32 addshore@tin: Synchronized php-1.32.0-wmf.4/extensions/Wikibase: [[gerrit:436004{{!}}Re add TermSqlIndex::getMatchingTerms select, but dont call]] (duration: 02m 18s)
* 14:29 addshore@tin: Synchronized php-1.32.0-wmf.5/extensions/Wikibase: [[gerrit:436003{{!}}Re add TermSqlIndex::getMatchingTerms select, but dont call]] (duration: 02m 13s)
* 14:13 addshore@tin: Synchronized php-1.32.0-wmf.4/extensions/Wikibase: [[gerrit:436001{{!}}track all wb_terms table access via statsd]] (duration: 02m 19s)
* 14:10 addshore@tin: Synchronized php-1.32.0-wmf.5/extensions/Wikibase: [[gerrit:436000{{!}}track all wb_terms table access via statsd]] (duration: 02m 21s)

ArticlePlaceholder search exposure, PropertySuggester and Special:ItemDisambiguation are still disabled individually.

There is also a new grafana dashboard monitoring the calls to the wb_terms table now. https://grafana-admin.wikimedia.org/dashboard/db/wikibase-wb_terms

I created T195792: Add support for setting individual query timeout in wikimedia/rdbms as something that would be useful to have right now for this case to avoid it taking down the whole shard.
Another thing that I would personally like to look into is pages to wikidata deployers for s8 issues / lag and or www.wikidata.org downtime / exceptions (not filed a ticket yet)

Mentioned in SAL (#wikimedia-operations) [2018-05-29T15:52:02Z] <addshore@tin> Synchronized wmf-config/Wikibase.php: [[gerrit:435147|Revert - Dont load PropertySuggester]] T195520 (duration: 01m 19s)

Change 435079 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Use TypeDispatchingEntitySearchHelper for SearchHelper in several places

https://gerrit.wikimedia.org/r/435079

Relating to where the tmp1 index actually came from T47529#518889

So to wrap this ticket up the incident report can be found at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180524-wikidata

There are still a collection of mid term actionables in progress, but the outage itself stopped quite some time ago.

Change 435042 abandoned by Jforrester:
Do not register SpecialItemDisambiguation to stop DoS.

Reason:
Production wikis are now running wmf.8 or wmf.10.

https://gerrit.wikimedia.org/r/435042

Change 435044 abandoned by Jforrester:
Do not register SpecialItemDisambiguation to stop DoS.

Reason:
Production wikis are now running wmf.8 or wmf.10.

https://gerrit.wikimedia.org/r/435044

Vvjjkkii renamed this task from Multiple projects reporting Cannot access the database: No working replica DB server to 7bcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Addshore as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
1339861mzb renamed this task from 7bcaaaaaaa to Multiple projects reporting Cannot access the database: No working replica DB server.Jul 1 2018, 6:17 AM
1339861mzb updated the task description. (Show Details)
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM