Basics: We rolled out 1.21wmf5 to the non-Wikipedia sites today, after a brief reversion and re-deployment to fix breakage in how we were displaying some styling. We are on track to deploy 1.21wmf5 to English Wikipedia on Monday, December 3 per https://www.mediawiki.org/wiki/MediaWiki_1.21/Roadmap .
Below: why this happened and how it got fixed, and what we should change to prevent problems like this in the future.
What happened:
https://gerrit.wikimedia.org/r/#/c/30361/ changed the headings in the Vector skin. The new code didn't take the WMF config into account, as the author wasn't expecting styles and HTML to be cached in such different ways.
The headings were changed from "h4"/"h5", but the CSS used those tags to identify them (instead of using CSS classes). Which means, as expected, that the page layout breaks for up to 30 days.
Page cache is controlled by the wiki page content. Unless the page is modified, the cache is kept for up to 30 days for anonymous users. Resource modules, however, are served by ResourceLoader which has its own much more efficient and deployable cache mechanism. But this means that the resources for the skin are deployed globally and site-wide within 5 minutes.... whereas the HTML isn't for another 2 weeks.
The issues that caused were visible in beta labs for the last three days, but none of us realized they were significant, we thought they were caused by a misconfigured memcache; see https://bugzilla.wikimedia.org/show_bug.cgi?id=42452 .
We knew that this particular change and the related change https://gerrit.wikimedia.org/r/#/c/34702/ might be problematic and sent out a note about it on Monday -- http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2012-November/0000... -- but it looks like we didn't test thoroughly enough on Monday and Tuesday to catch it before the Wednesday deploy. Only anonymous users would have been affected. We don't cache logged-in users in Squid. So logged-in users didn't notice problems on mediawiki.org and test2.wikipedia.org after the first deploy.
Problems popped up after the Phase 2 deployment to non-Wikipedia sites, so we reverted the 1.21wmf5 deployment and then redeployed while fixing, purging, etc.
Bug: https://bugzilla.wikimedia.org/show_bug.cgi?id=42452 Gerrit changes: https://gerrit.wikimedia.org/r/#/c/35819 , https://gerrit.wikimedia.org/r/#/c/35815/ , https://gerrit.wikimedia.org/r/#/c/35817/
What we should fix for the future:
This is why client resources must always be backwards compatible.
"Don't change the HTML in incompatible ways" is probably a good general rule to live by--but having an easy way to say "start purging all pages on $theseWikis from Squid/Varnish" would also be nice.
get more manual testing on test2.wikipedia.org and mediawiki.org immediately after Phase I deployment, including as anonymous reader and editor to ensure we catch Squid caching issues
train more people to review code well, to reduce backlog and catch these kinds of problems?
get more people to +2 in core and in important extensions
beta labs needs to be trustworthy enough to make this sort of thing a blocker immediately
Chris McMahon's take: (for what it's worth, this seems to me to be a sign that beta labs is becoming more and more trustworthy all the time. The more we actually use it, the more we'll understand what does and does not work there. We fixed the memcache problem, which fixed the ability to login, but didn't investigate the display problems because we're used to beta not being very reliable. In this case, beta was reliable, and we didn't understand that. Even with a bug report in bugzilla with 9 subscribers, no one recognized a real issue.)
Chris McMahon said: I think this could be framed as an issue of signal, noise, and bandwidth. Beta labs being broken a lot, review backlog in gerrit, false failures in tests are all noise. Given the constraints of ongoing projects, it is difficult to pick out the signal from the noise. We can take steps to reduce the noise so that the signal stands out more by reducing technical debt: make the tests green, make the test environment robust, keep up with code review.
(I assembled this just now from IRC & mailing list chatter from several people, and errors are mine -- sorry for missing attributions here. Drafting was on http://etherpad.wmflabs.org/pad/p/nov-28-2012-deploysnafu )