Incidents/2023-02-22 read only
document status: final
Summary
Incident ID | 2023-02-22 read only | Start | 2023-02-22 11:03:25 (major impact starts at 2023-02-22 12:16:21) |
---|---|---|---|
Task | T330300 | End | 2023-02-22 12:18:48 |
People paged | 0 | Responder count | ~7 |
Coordinators | Jcrespo | Affected metrics/SLOs | ? |
Impact | For approximately 2 minutes, editing was disabled site-wide. For approximately 54 minutes, editing failed for some users in the codfw datacenter (around 1-2% of all edits) |
While performing a live switchover test in advance of the 2023 WMF datacenter switchover, an existing logical bug on the switchover test script accidentally set the secondary datacenter in read-only mode. While this didn't disrupt most users, mobile editing for people geolocated to codfw app servers (mostly, people in the Americas, and part of Asia and Oceania) had the editing interface disabled (while desktop users were redirected to edit through eqiad). While trying to fix this issue, an tooling interface issue caused all datacenters to be set in read-only mode, disabling editing for all users. This was quickly reverted for both datacenters and editing was restored.
Timeline
All times in UTC.
- 11:03 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.02-set-readonly
- 11:03 <+logmsgbot> !log cgoubert@cumin1001 [DRY-RUN] MediaWiki read-only period starts at: 2023-02-22 11:03:19.149671 Mediawiki is now read-only in codfw only - Minor editing outage starts now
- 11:13 <+logmsgbot> !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
Only sets read-write in eqiad - Codfw is still read-only with the switchover message
User reports warn of ongoing issues (most edits from eqiad app servers and desktop-codfw can flow normally):
- 11:39 <Yahya> Hello, bnwiki is now read-only. Some users can edit and some can't. Can anyone tell me if any maintenance work is going on! Never seen a wiki is read-only for so long.
- 11:42 <taavi> I ma about to leave but -tech has a report of users seeing read-only errors
- 11:42 <jynus> taavi: which wiki? en?
- 11:42 <taavi> bn
- 11:42 <Bsadowski1> yeah bn
- 11:43 <jynus> that's s3
- 11:43 <claime> that's not normal, we should not be changing the RO status in the live DC during the live-test
- 11:49 <taavi> the timing matches with the read-only cookbook
Debugging ensues, as well as potential unrelated causes.
- 12:09 <claime> cgoubert@cumin1001:/var/log/spicerack/sre/switchdc$ sudo confctl --object-type mwconfig select name=ReadOnly get
- 12:09 <claime> {"ReadOnly": {"val": "false"}, "tags": "scope=codfw"}
- 12:09 <claime> {"ReadOnly": {"val": false}, "tags": "scope=eqiad"}
- 12:13 <@taavi> why is the other false a string and the other a boolean?
The cause of read-only is confctl not setting the right type and putting a string instead of a boolean
Multiple combinations of confctl set tried:
- 12:15 <claime> sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=false
- 12:15 <claime> sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=False
- 12:16 <claime> sudo confctl --object-type mwconfig select name=ReadOnly,scope=codfw set/val=no
- 12:16 <+logmsgbot> !log akosiaris@cumin1001 conftool action : set/val=false; selector: name=ReadOnly
This last one sets eqiad read-only by the same mechanism, the variable is now a string instead of a boolean, which is interpreted by mw as being "true"
Eqiad is now read-only too - Major editing outage starts now
- 12:18 Incident opened. Jaime becomes IC.
- 12:18 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
- 12:18 <+logmsgbot> !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-02-22 12:18:11.451680
- 12:18 <+logmsgbot> !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0)
- 12:18 <+logmsgbot> !log cgoubert@cumin1001 START - Cookbook sre.switchdc.mediawiki.07-set-readwrite
- 12:18 <+logmsgbot> !log cgoubert@cumin1001 MediaWiki read-only period ends at: 2023-02-22 12:18:45.829060
- 12:18 <+logmsgbot> !log cgoubert@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) - Outage stops now
Using the sre.switchdc.mediawiki.07-set-readwrite cookbook to set the right value type, running it once with codfw -> eqiad and once with eqiad -> codfw to set them both.
- Both codfw and eqiad are now back to readwrite status
- 12:22 - 12:26: Double checking with users the issue is gone
- 12:39 Issue declared as resolved
Detection
Editing issue from mobile + codfw:
- No alerting went off because of this
- Reports from #wikimedia-tech surfaced ongoing issues when editing from the mobile interface (read only disabled the edit button, while on desktop edits were sent to codfw)
Full read only mode issue:
- [12:20:17] <jinxer-wm> (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
Although by this time the issue had been already corrected.
Specifically, failing to set codfw as read-write wasn't detected as failing until some time passed and reports confirm the issue persisted.
Conclusions
What went well?
- Test running gave an early heads up to people on call in case something went wrong/monitoring happened
- Several volunteers quickly and effectively rised issues on #wikimedia-tech, and collaborated to help resolve the issue, specially when error rate was low
- While there were not necessary in this scenario, there are multiple layers preventing a split-brain between datacenters (writes happening on two datacenters at the time, independently)
What went poorly?
- Monitoring didn't catch the initial low rate of errors, as it was between 10-20 per minute and only 1-2% of total edits (plus no possible monitoring of the edits that were never done because disabled on ui)
- Different behavior on desktop vs mobile for read only, confusing the debugging
- Manual reverting was confusing or error-prone due to data type issues
Where did we get lucky?
Links to relevant documentation
- MediaWiki and EtcdConfig
- Conftool
- Switch Datacenter
- https://config-master.wikimedia.org/mediawiki.yaml
Actionables
- bug T330300: sre.switchdc.mediawiki.07-set-readwrite doesn't reset both datacenter to rw Done
- Stricter conftool data type validation?
- Uniformize mobile and desktop behaviour when in read only?
- bug T330304: Globalize mwconfig ReadOnly (would avoid unpredictable behaviour when one DC is RO and not the other)
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
Were the people who responded prepared enough to respond effectively | no | ||
Were fewer than five people paged? | yes | No paging happened | |
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | yes | No one was paged | |
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | yes | https://docs.google.com/document/d/1SwXRLONP4fG6YKfCg5B26IozpQ6Hst424_ihOH0anEA/edit |
Was a public wikimediastatus.net entry created? | yes | https://www.wikimediastatus.net/incidents/yhshxyn9pw22 | |
Is there a phabricator task for the incident? | yes | ||
Are the documented action items assigned? | yes | ||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
no | The task that caused the issue was the one created to prevent the issue (circular dependency) |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | no | ||
Were the engineering tools that were to be used during the incident, available and in service? | no | Reverting the change caused confusion | |
Were the steps taken to mitigate guided by an existing runbook? | yes | ||
Total score (count of all “yes” answers above) | 10 |