Page MenuHomePhabricator

ChangeProp / RESTBase / Parsoid outage 2016-05-05
Closed, ResolvedPublic

Description

Summary

A combination of several subtle issues lead to a feedback loop in a test deploy of the ChangeProp service, which is designed to orchestrate the propagation of changes across services using EventBus. This created a spike of un-cacheable requests to RESTBase and Parsoid, which temporarily overloaded those clusters, and resulted in users experiencing high latencies and request errors for a duration of about one hour.

Timeline

2016-05-04T15:02:45 first requests to parsoid for oldId 106801025 (from ChangePropagation/WMF)

<date change>

2016-05-05T02:16:10 first INFO  [HintedHandoff:2] 2016-05-03 02:16:10,523 HintedHandOffManager.java:486 - Timed out replaying hints to /10.192.32.137; aborting (25847 delivered) (on restbase1014)

2016-05-05T02:28 org.apache.cassandra.db.compaction.CompactionTask log events elevated

2016-05-05T02:3X bytes_out and cpu_user start to spike on rb cluster

2016-05-05T02:3X load spikes on parsoid cluster

2016-05-05T02:34:16 first "Retry count exceeded" error in /srv/log/changeprop/main.log (for http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE)

2016-05-05T02:38:14 first alert:  <icinga-wm>         PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

2016-05-05T03:20:00 (ca.) Ori stops changeprop service

2016-05-05T03:30:00 (ca.) Load on Parsoid and RESTBase drops, service recovers.

Analysis

  1. The lower-case version of http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE was actually an illegal title according to the wiki's title capitalization setting, but this was not detected as a side-effect of an mbstring / HHVM bug T91056: Some letters (initial characters of page titles) not being correctly capitalised. When interpreted according to unicode & normalization rules of the wiki, it was a redirect to itself.
  1. The uppercase title was edited, and change-propagation picked up the edit and requests updates in restbase with a no-cache request for the page summary end point.
    • Note: The same update is also triggered by normal job processing.
  1. RESTBase checked whether the page is a redirect. While the uppercase version of the page was not a redirect, the lower-case version was. RESTBase always stores content and revision information at normalized titles, which for this wiki means an upper-cased first char. The lower-case version of the article happened to have a higher revision number, so the redirect flag was picked up. To find the redirect target, the access / redirect filter then requested HTML to parse redirect target, but forwarded all the headers, including no-cache [bug](https://github.com/wikimedia/restbase/blob/master/lib/revision_table_access_check_filter.js#L37).
  1. In parsoid.js (in the RESTBase codebase), the missing revision parameter then triggered no-cache revision fetching, which retrieved the latest revision of the *uppercase* title from MW API, yielding 106801025.
  1. This revision was then requested from parsoid. The content returned from parsoid was classified as differing & was saved.
  1. Changeprop picked up the resource change event from #5, going back to #1 of the loop.

This by itself would have just caused a sequential repeat of a single request, and would not have lead to an outage. However, several other issues caused an exponential retry amplification, which in turn created enough load on RESTBase and Parsoid to make this an outage:

  • RESTBase secondary updates (temporarily inline) also perform an update request to the Mobile Content Service (MCS)
    • MCS in turn requests HTML from RESTBase without no-cache but with ?redirect=false, gets HTML with self-redirect
    • Inline redirect extraction was still active in MCS, causing it to repeat requests 10x (all served from storage) until redirect limit reached, then returned 400 from parsoid-access.js (lots of log entries for this in RB logs).
    • Since this was a no-cache request, secondary updates are done synchronously, and errors are reported. A generic 500 was returned from RESTBase.
  • The change-propagation service received the 500 response (triggered by MCS redirect limit) and, after a delay of ~500ms, retried (5x) until the retry count was exceeded.
  • ChangeProp processed retries eagerly & in parallel with resource_change events. There was no limit on processing concurrency apart form CPU, and delays for retries were short (500ms). As a result, parallelism of retries & event processing ramped up until the CPU was saturated, sending out at least 500+ un-cacheable requests per second.

Actionables

RESTBase 

  • Improve logging / error reporting from temporary dependency updates.
    • These won't be around for much longer as they are moving to the changeprop service in the next weeks. Does not seem to be worth it. [done]

ChangeProp service

Mobile Content Service

Parsoid

  • investigate slow render of the pages involved here

Alerts

Event Timeline

GWicke updated the task description. (Show Details)
GWicke edited subscribers, added: mobrovac, Pchelolo, ssastry and 4 others; removed: Aklapper.

Worthwhile replacing all github urls with a version tied to a commit. This prevents the link pointers from pointing to the wrong lines as the code evolves.

Probably should be copied to a new page in https://wikitech.wikimedia.org/wiki/Incident_documentation as well unless you can convince @greg that we should be switching to Phab for this sort of thing.

@bd808, it'll be easier for us to follow up on the actionables here. I created a pointer from the wiki, so it shows up as part of the incident list there.

See also T135171 for a related investigation into VE requests without if-match headers.

From what I can tell all but the Parsoid issues have been dealt with. Should we resolve this?

Most issues have indeed been addressed, and most of the remaining ones are also well underway. I agree that this task is no longer useful, and we should resolve it.

However, lets make sure that the remaining issues are indeed not forgotten by creating separate tasks for them first.

Resolving, as the remaining issues are all tracked / in progress elsewhere, with the exception of the Parsoid slow-render investigation. @ssastry, please create a separate task for that follow-up if you still consider it actionable.