Parsoid outage 2016-05-05
Closed, ResolvedPublic
Actions

Description

Summary

A combination of several subtle issues lead to a feedback loop in a test deploy of the ChangeProp service, which is designed to orchestrate the propagation of changes across services using EventBus. This created a spike of un-cacheable requests to RESTBase and Parsoid, which temporarily overloaded those clusters, and resulted in users experiencing high latencies and request errors for a duration of about one hour.

Timeline

2016-05-04T15:02:45 first requests to parsoid for oldId 106801025 (from ChangePropagation/WMF)

2016-05-05T02:16:10 first INFO [HintedHandoff:2] 2016-05-03 02:16:10,523 HintedHandOffManager.java:486 - Timed out replaying hints to /10.192.32.137; aborting (25847 delivered) (on restbase1014)

2016-05-05T02:28 org.apache.cassandra.db.compaction.CompactionTask log events elevated

2016-05-05T02:3X bytes_out and cpu_user start to spike on rb cluster

2016-05-05T02:3X load spikes on parsoid cluster

2016-05-05T02:34:16 first "Retry count exceeded" error in /srv/log/changeprop/main.log (for http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE)

2016-05-05T02:38:14 first alert: <icinga-wm> PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

2016-05-05T03:20:00 (ca.) Ori stops changeprop service

2016-05-05T03:30:00 (ca.) Load on Parsoid and RESTBase drops, service recovers.

Analysis

The lower-case version of http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE was actually an illegal title according to the wiki's title capitalization setting, but this was not detected as a side-effect of an mbstring / HHVM bug T91056: Some letters (initial characters of page titles) not being correctly capitalised. When interpreted according to unicode & normalization rules of the wiki, it was a redirect to itself.

The uppercase title was edited, and change-propagation picked up the edit and requests updates in restbase with a no-cache request for the page summary end point.
- Note: The same update is also triggered by normal job processing.

RESTBase checked whether the page is a redirect. While the uppercase version of the page was not a redirect, the lower-case version was. RESTBase always stores content and revision information at normalized titles, which for this wiki means an upper-cased first char. The lower-case version of the article happened to have a higher revision number, so the redirect flag was picked up. To find the redirect target, the access / redirect filter then requested HTML to parse redirect target, but forwarded all the headers, including no-cache [bug](https://github.com/wikimedia/restbase/blob/master/lib/revision_table_access_check_filter.js#L37).

In parsoid.js (in the RESTBase codebase), the missing revision parameter then triggered no-cache revision fetching, which retrieved the latest revision of the *uppercase* title from MW API, yielding 106801025.

This revision was then requested from parsoid. The content returned from parsoid was classified as differing & was saved.
- Reason for always-diff: The contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 was only based on the request parameters, which in this case *did not include the revision*. The latest stored revision was actually 125284517 (corresponding to the lowercase title), thus causing the content to mis-match when compared to the re-render of 106801025.
- Secondary updates emitted a resource change event for HTML

Changeprop picked up the resource change event from #5, going back to #1 of the loop.

This by itself would have just caused a sequential repeat of a single request, and would not have lead to an outage. However, several other issues caused an exponential retry amplification, which in turn created enough load on RESTBase and Parsoid to make this an outage:

RESTBase secondary updates (temporarily inline) also perform an update request to the Mobile Content Service (MCS)
- MCS in turn requests HTML from RESTBase without no-cache but with ?redirect=false, gets HTML with self-redirect
- Inline redirect extraction was still active in MCS, causing it to repeat requests 10x (all served from storage) until redirect limit reached, then returned 400 from parsoid-access.js (lots of log entries for this in RB logs).
- Since this was a no-cache request, secondary updates are done synchronously, and errors are reported. A generic 500 was returned from RESTBase.

The change-propagation service received the 500 response (triggered by MCS redirect limit) and, after a delay of ~500ms, retried (5x) until the retry count was exceeded.
ChangeProp processed retries eagerly & in parallel with resource_change events. There was no limit on processing concurrency apart form CPU, and delays for retries were short (500ms). As a result, parallelism of retries & event processing ramped up until the CPU was saturated, sending out at least 500+ un-cacheable requests per second.

Actionables

RESTBase

Don't propagate the cache-control header for access checks: https://github.com/wikimedia/restbase/pull/604 [done]

change contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 to always use a / the same revision as is used for the parsoid request: https://github.com/wikimedia/restbase/pull/611 [done]

Create a generic way to prevent event loops: https://github.com/wikimedia/restbase/pull/607 [done]

Look into stopping no-cache revision fetching as restriction update mechanism (avoids mis-match between revision & content storage).
- This will happen with the move to a separate restriction table, see https://github.com/wikimedia/restbase/pull/599.

Improve logging / error reporting from temporary dependency updates.
- These won't be around for much longer as they are moving to the changeprop service in the next weeks. Does not seem to be worth it. [done]

ChangeProp service

T134456: Implement concurrency limiting on change propagation & enforce conservative limits and delays

T134483: Change Propagation should not follow redirects

Mobile Content Service

T134538: Remove redirect follow code from MCS

Parsoid

investigate slow render of the pages involved here

consider setting up rate limits for repeat parses of the same domain / title / revision: T134516: Ban a set of titles

Alerts

Set up mail alerts to services@ for service_checker failures: T134551: Create functional cluster checks for all services (and have them page!)
- In node.pp or restbase.pp ?

Set up alert / monitoring against LVS end point, and make it page services *and* ops; should only trigger if service overall is down (not individual hosts): T134551: Create functional cluster checks for all services (and have them page!)

Related Objects
Search...

Status	Assigned	Task
Resolved	• GWicke	T134537 ChangeProp / RESTBase / Parsoid outage 2016-05-05
Resolved	akosiaris	T134551 Create functional cluster checks for all services (and have them page!)
Resolved	• Pchelolo	T134456 Implement concurrency limiting on change propagation & enforce conservative limits and delays

Event Timeline

• GWicke created this task.May 5 2016, 11:47 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 5 2016, 11:47 PM

• GWicke updated the task description. (Show Details)May 5 2016, 11:49 PM

• GWicke mentioned this in T134538: Remove redirect follow code from MCS.May 5 2016, 11:52 PM

• GWicke updated the task description. (Show Details)

• GWicke claimed this task.May 5 2016, 11:54 PM

• GWicke updated the task description. (Show Details)

• GWicke added projects: SRE, Mobile-Content-Service.

• GWicke edited subscribers, added: • mobrovac, • Pchelolo, ssastry and 4 others; removed: Aklapper.

Worthwhile replacing all github urls with a version tied to a commit. This prevents the link pointers from pointing to the wrong lines as the code evolves.

ssastry updated the task description. (Show Details)May 6 2016, 12:04 AM

• GWicke updated the task description. (Show Details)May 6 2016, 12:28 AM

• GWicke updated the task description. (Show Details)May 6 2016, 12:31 AM

Probably should be copied to a new page in https://wikitech.wikimedia.org/wiki/Incident_documentation as well unless you can convince @greg that we should be switching to Phab for this sort of thing.

@bd808, it'll be easier for us to follow up on the actionables here. I created a pointer from the wiki, so it shows up as part of the incident list there.

• Mholloway subscribed.May 6 2016, 2:55 AM

Joe created subtask T134551: Create functional cluster checks for all services (and have them page!).May 6 2016, 5:05 AM

• GWicke updated the task description. (Show Details)May 6 2016, 4:00 PM

• GWicke updated the task description. (Show Details)May 6 2016, 4:51 PM

• GWicke updated the task description. (Show Details)May 6 2016, 7:12 PM

• GWicke updated the task description. (Show Details)

• bearND subscribed.May 9 2016, 4:58 AM

• bearND moved this task from Incoming to Tracking on the Mobile-Content-Service board.May 9 2016, 5:03 AM

The full incident report is available at wikitech:20160505-ChangeProp_RESTBase_Parsoid now as well.

• Pchelolo updated the task description. (Show Details)May 9 2016, 8:53 PM

Arlolra mentioned this in T134593: Intermittent corruptions on saves, likely related to data-parsoid mis-matches.May 9 2016, 9:36 PM

• GWicke updated the task description. (Show Details)May 11 2016, 1:09 AM

• Elitre subscribed.May 11 2016, 1:39 AM

• GWicke updated the task description. (Show Details)May 11 2016, 3:40 PM

• GWicke updated the task description. (Show Details)May 11 2016, 3:44 PM

See also T135171 for a related investigation into VE requests without if-match headers.

• mobrovac added a subtask: T134456: Implement concurrency limiting on change propagation & enforce conservative limits and delays.May 16 2016, 2:04 PM

• mobrovac closed subtask T134456: Implement concurrency limiting on change propagation & enforce conservative limits and delays as Resolved.May 18 2016, 6:21 AM

From what I can tell all but the Parsoid issues have been dealt with. Should we resolve this?

Most issues have indeed been addressed, and most of the remaining ones are also well underway. I agree that this task is no longer useful, and we should resolve it.

However, lets make sure that the remaining issues are indeed not forgotten by creating separate tasks for them first.

Resolving, as the remaining issues are all tracked / in progress elsewhere, with the exception of the Parsoid slow-render investigation. @ssastry, please create a separate task for that follow-up if you still consider it actionable.

akosiaris closed subtask T134551: Create functional cluster checks for all services (and have them page!) as Resolved.Mar 14 2023, 5:31 PM

Restricted Application added a project: Product-Infrastructure-Team-Backlog-Deprecated. · View Herald TranscriptMar 14 2023, 5:31 PM

ChangeProp / RESTBase / Parsoid outage 2016-05-05Closed, ResolvedPublicActions

Description

Summary

Timeline

Analysis

Actionables

RESTBase

ChangeProp service

Mobile Content Service

Parsoid

Alerts

Related ObjectsSearch...

Event Timeline

ChangeProp / RESTBase / Parsoid outage 2016-05-05
Closed, ResolvedPublic
Actions

Related Objects
Search...