Lijiang Fang’s Post

Engineering Leader | ex-Microsoft VP | ex-Meta

2mo

This is the explanation from Crowdstrike about the root cause of last week's BSOD. The testing failed to detect the buggy file, and then the buggy file was deployed to all customer machines at the same time. The industry has learned and developed techniques and practices to safeguard production deployments for a long time. Unfortunately, those practices have not been widely adopted. This incident reminds me of the work we did at Microsoft Autopilot a decade ago. We learned that even though a code change passed testing, it could still bring down the system with real-world inputs. So, we developed safe application deployment. App changes were only applied to a small percentage of machines in an online cluster. A watchdog closely monitored the health. It then gradually increased the percentage of machines when the applications were all healthy. Similarly, for data changes. We also learned that even a simple one-bit change in the config could bring down a system. So, we developed safe config deployment. Similarly, the config change was only applied to a small percentage of machines and closely monitored. Only when the change did not cause issues, was it then increased to more machines. Even with such rigorous monitoring, sometimes failures could still be missed by these automated monitoring mechanisms. Staged deployment was then developed. An application, data, or config change would first be applied to a certain region. After waiting for some time, if no issues were reported, it would move to another region, and then another region. I hope such safe rollout practices will be widely adopted, especially for services that have huge footprints. https://lnkd.in/ge7Vaqrs

Falcon Content Update Preliminary Post Incident Report | CrowdStrike

crowdstrike.com

2 Comments

Qichao L.

Product Owner at Broadcom Inc. | AI & Machine Learning | USC Marshall MBA

2mo

Love this

Robert Zhu

2mo

Well said!

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Bill Spurlin

Software Engineer
2mo
Report this post
On July 24 CrowdStrike released a "Preliminary Post Incident Review" https://lnkd.in/eK8aMK_q which unfortunately raises more questions than it answers about the recent global computer outage caused by automated deployment of content update to CrowdStrike Falcon on several million Windows computers. According to CrowdStrike "On March 05, 2024, following the successful stress test [of Channel File 291 ], an IPC Template Instance was released to production as part of a content configuration update ... On July 19, 2024, two additional IPC Template Instances [ in addition to three previously deployed from the cloud ] were deployed. Due to a bug in the Content Validator [ validation software running in the cloud ] , one of the two Template Instances passed validation despite containing problematic content data... When received by the sensor and loaded into the Content Interpreter, problematic content in Channel File 291 resulted in an out-of-bounds memory read triggering an exception. " It appears that there are three separate but related failures, bugs, or cases of "problematic content": a failure to adequately test Channel Files, to test Template Instances and a bug in the Content Validator itself. A latent bug in a Channel File appears to have been triggered by "problematic content" in a Template Instance that was not adequately tested, or not tested at all, by the Content Validator against Channel File 291. These failures reveal troubling possibilities. Questions for CrowdStrike might be: What other Channel Files or validation programs have undetected bugs? Would a customer attempting to revert from a problematic release to an earlier Channel File encounter a lurking defect? Under what circumstances does CrowdStrike release software, either Channel Files or Template Instances that have not been comprehensively regression tested? Exactly what kind of tests did the Content Validator not perform that allowed the present catastrophic Template Instance to be released? What is the nature of the bug in the Content Validator, and can it be fixed? Or is the very concept of a Content Validator that does not perform regression testing flawed? CrowdStrike has admitted releasing two separate bugs in the period March - July 2024, due to inadequate testing both of Channel Files and Template Instances, and to the existence of a third in the Content Validator. How many other potentially disastrous bugs might exist in CrowdStrike product?

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

crowdstrike.com
Like Comment
To view or add a comment, sign in
Jon Lusky

Building stuff that works, and fixing stuff that doesn't. Providing technology resources to enable people and Agile teams to be successful and the business to drive revenue. Proven IT leader for rapidly growing startups.
2mo
Report this post
CrowdStrike has posted their preliminary post-incident review on their blog. My takeaways: Was this content update smoke tested? Nope, it doesn't. Doesn't explicitly say it wasn't, but doesn't it was. On the other hand, it doesn't sound like it was completely untested, and sounds like there was a test ("Content Validator') in place that should have caught this before deployment, and that test was defective. Could they have done more testing? Definitely. Should they have done more testing? Hindsight says yes. Did they think were doing adequate testing before this incident? Can't say, but my gut is there were a lot of people who thought it was safe because it was "just configuration". Should they do more/better testing? Definitely. Will they do more testing? Sounds like it. Is CrowdStrike going to recover from this incident? My guess is yes. Falcon EDR was good product before and still is now. With the microscope on them now, I'd expect them to make some significant and visible improvements in testing to provide assurance that this won't happen again, and some of their competitors will be scrambling to respond with both explanations and new process/features/code to show how they are or will be on-par with deployment and availability protections that CrowdStrike has said they will implement. https://lnkd.in/eccscZ9B

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

crowdstrike.com

4 Comments
Like Comment
To view or add a comment, sign in
Jeff Schmidt

Experienced executive and investor focused on Information Security and Risk Management
2mo
Report this post
CrowdStrike "Preliminary Post Incident Review (PIR)" is out: https://lnkd.in/gCujxA2M It isn't my intention to throw stones; rather, I think it would be silly of us not to think critically and learn lessons from the "largest IT outage in history." I applaud CrowdStrike's transparency and communication during this difficult time. Key takeaways: * CrowdStrike carefully stated: "This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver." But then went on to later state: "[their kernel driver] reads the Channel File and interprets the Rapid Response Content." This confirms earlier hypothesis by myself and others. * CrowdStrike views these "Rapid Response Content" updates (distributed through Channel Files) as less consequential (my words) than driver updates and were therefore subject to less testing and were rolled out all at once with no staged/canary deployments - and their clients didn't have the ability to stage rollout (!). In general, again my characterization, these Channel Files were treated with less seriousness than other parts of their system (signed kernel drivers subject to WHQL). * As one would expect, based on the above, their future actions will focus on improved resiliency and testing and treating the Rapid Response Content Deployment activities more seriously and providing clients options to stage these updates, like they are able to stage driver updates. This episode has surfaced an important issue: software vendors (particularly AV/EDR vendors) effectively make backdoor updates to their kernel mode code by sending updates that are "interpreted" by their signed/WHQL drivers. This is probably a common tactic used when update speed (security applications) is critical and incompatible with Microsoft's signing/WHQL process. In my opinion, Microsoft will make changes here. Microsoft created the signing/WHQL program specifically to reduce the impact of bugs in privileged kernel mode code provided by third party vendors. Allowing signed/certified code to essentially update itself at will outside the signing/WHQL process seems problematic. That being said, speed of updates is obviously important so a balance will need to be struck. But I believe Microsoft will drive change here out of self-preservation, as they invariably get blamed for anything that crashes Windows.

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

crowdstrike.com

1 Comment
Like Comment
To view or add a comment, sign in
Bogdan Stefan Plesa

Quality Assurance Engineer
2mo
Report this post
Quick questions from the public for CrowdStrike after their post incident review https://lnkd.in/dRXPtUca Question1: If your content validation fails to fail the CI for null byte files, what does it do on corrupted, empty or otherwise swapped files? Do you test for any of that or are going to have another incident someone pushes the wrong content to one of your channel files? Question2: What does pushing 2 working files have to do with pushing a broken one. Brakes on a car only need to fail once, that's why we should continuously test and maintain them. Question3: What does a failed content validation have to do with the fact that your kernel driver tried to read from a nonexistent memory address and crashed the Kernel? That could have happened if the file was perfectly fine but contained a bad offset.

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

crowdstrike.com
Like Comment
To view or add a comment, sign in
Daniel C.

Information Security | Pentesting | Embedded Hardware Engineering | Systems Administation | eJPT
2mo Edited
Report this post
Looks like CrowdStrike updated with a preliminary post incident review a few hours ago. This does seems to shed some light into their process, particularly that part of their QA process involves subjecting their own systems to the updates they hoist upon their customers. I'm more than a little disturbed that they think including proprietary blobs into their products is appropriate, but that's more a minor nitpick for me. After reading over the entirety of this, it does not give the impression this wasn't negligence. It does show a flaw in their build and validation process that could go awry because of a small bug from their Content Validator. It doesn't really explain what's really involved in the validation process, only that the bug that marked these files a passing validation is what led to the out-of-bounds memory reads by their sensor being made and not being able to gracefully recover from this. Despite some earlier posts made by some very ill-informed users on the Internet, this was not a result of a de-referenced null pointer, but a result of these channel files effectively getting the sensor to read somewhere they weren't supposed to and the driver not accounting for this, halting the code and thus the kernel. They have indicated that they still haven't been able to report the root cause analysis that led to all of this, but plan to share once they do.

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

crowdstrike.com

1 Comment
Like Comment
To view or add a comment, sign in
Caleb Crandall

Context-driven software engineer in test | Scrum master
2mo
Report this post
I'm disappointed in all the testers simply chalking the CrowdStrike incident as "you can't find every bug" and acting as if this kind of major incident is just inevitable. As the CrowdStrike Preliminary Post Incident Review admits, there were _plenty_ of opportunities for improvement in their processes and testing that _could have_ either prevented this defect from escaping, or made the impact much less severe. Over-reliance on any one form of defect detection or prevention is a good way to set yourself up for this kind of failure. In this case, it seems they over-relied on their automated "Content Validator" tool. Guess what? Sometimes your automation tooling itself has bugs. This is one reason I take issue with the CICD concept of "the [automated] pipeline decides the releasability of changes"--sometimes having a human in the loop will catch things that even a good and reasonably thorough automation suite won't. I think it's a positive that their report lists a variety of countermeasures, including: * Having developers test more on their local machines * Rollback testing * Staggered deployments * Better monitoring * Actually giving customers more control of how and when this kind of content is deployed to their systems * Providing release notes so customers can better understand what's changing Again, relying on any one of these countermeasures would likely be insufficient, but these kinds of "diverse half measures" make it much more likely that defects can be found and fixed before they cause such a global outage. And I suspect that if CrowdStrike had put more thought into a diverse quality strategy, they absolutely could have prevented this problem, which itself seems to have been a bit of a cascade of multiple problems. How's your test strategy? Are you over-relying on any one approach? Now would be a good time to evaluate and branch out, before your company also makes the news for the wrong reasons. #criticalthinking #softwaretesting https://lnkd.in/gJRZnJ2b

Falcon Content Update Preliminary Post Incident Report | CrowdStrike

crowdstrike.com

16 Comments
Like Comment
To view or add a comment, sign in
Daniel Dib

Sr. Architect at Conscia | CCIE #37149 | CCDE #20160011 | Author | Technical Writer | Blogger | Mentor
2mo
Report this post
It's good to see that #Crowdstrike are being transparent with their findings. All organizations fail at some point. All you can do is be better the next time. Their preliminary Post Incident Review (PIR) shows that even though they do extensive testing, due to a bug, the error was not caught. "Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data." As the tests were successful, the new code was pushed to prod: "Based on the testing performed before the initial deployment of the Template Type (on March 05, 2024), trust in the checks performed in the Content Validator, and previous successful IPC Template Instance deployments, these instances were deployed into production." To prevent this from happening again, they suggest: Local developer testing ✅ Content update and rollback testing ✅ Stress testing, fuzzing and fault injection ✅ Stability testing ✅ Content interface testing ✅ In addition, they also suggest to: - Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment. - Improve monitoring for both sensor and system performance, collecting feedback during Rapid Response Content deployment to guide a phased rollout. - Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed. - Provide content update details via release notes, which customers can subscribe to. I think it's interesting to see here that they mention canary deployment and staggered deployment. Something that seems reasonable, although there's always a delicate balance of delaying updates when it comes to security patching. Customers will also have greater control. Again, it will be interesting to see what happens if customers opt out of updates and then get exploited. All in all, good job by Crowdstrike on making this publically available. Hopefully the entire industry can learn from this. What are your thoughts? Link in the comments below 👇
15 Comments
Like Comment
To view or add a comment, sign in
Steve McKee

.ılı.ılı. Account Executive
2mo
Report this post
This is a good Preliminary Post Incident Review by Crowdstrike. Lots of ‘lessons learned’ and - importantly - now looking to staggered content deployment. Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.https://https://lnkd.in/dNJBw2hj

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

crowdstrike.com
Like Comment
To view or add a comment, sign in
Cole Boren

Backend Developer @🐺Overwolf | Go, Python, JavaScript, .Net/C#
2mo
Report this post
I get on the "testing/quality code is important podium often", here's proof that it is. As a systems complexity grows, so should its tests. If someone has to change your code that there isn't even a unit test around, and then someone else makes a change in the same place or nearby, and this happens 2, 3, 4 more times. Pretty soon you will be firefighting or taking down the world. So many good lessons to learn from this and other things happening in tech right now. Wild times.

John Crickett

Helping you become a better software engineer by building real-world applications.
2mo

If you don't test your software it will cost you. Just like it's cost CrowdStrike - as I write this their market cap is down 25% (over 20B USD) from Friday. In their Preliminary Post Incident Review posted today they detail what they will do to avoid a repeat, here's some highlights: 👉 Local developer testing 👉 Content update and rollback testing 👉 Stress testing, fuzzing and fault injection 👉 Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment. There's some key lessons here, none of which are new: 👉 Developers should always test their code locally. 👉 Test config as well as code. Test the whole system. 👉 If you deploy to a large, complicated distributed system, use canary releases to catch errors early. You can read the full preliminary post incident review here: https://lnkd.in/eZRNYZTr

Falcon Content Update Remediation and Guidance Hub | CrowdStrike

crowdstrike.com
Like Comment
To view or add a comment, sign in
Darren Carpenter

Multipotentialite / Linux / DevOps / Systems Engineer
2mo Edited
Report this post
CrowdStrike released their root cause analysis (RCA) here: https://lnkd.in/g8GCuMrU The details are contained in a PDF linked from this blog post. I have followed this with some interest, because I was a developer on an IBM security product that shared some of the same kinds of challenges as the CrowdStrike Falcon product. Specifically, our product also had kernel level code though our product was only built for UNIX platforms, and most of my time was spent debugging more obscure system crash dumps. What I found most disturbing was the big rush to judgement on the part of many. To be sure the failure was a headache and highly visible to the world. Posts complaining about corporate greed and lack of quality practices and customer focus multiplied quickly online, and simple explanations had the upper hand. But for some of us who have worked in this area, and for the few people who did their own deeper analysis of the failure, we understood that the simplest explanations were likely incorrect, and that the real story was probably much more complicated. While many development shops have their problems, a product like this is not going to be the output of some mediocre developer. Especially for a mature product, ongoing sloppy development practices would have produced a long string of problems that likely would have stifled product adoption in the first place. Many commentators simply jumped to conclusions. One after another of the assumptions in the wild were refuted by CrowdStrike. Yet the simple narratives remained popular, fueled by anti-corporate sentiment and explanations from "experts". Many of the "authoritative" commentators, were neither doing any significant analysis, nor keeping up with CrowdStrike's own updates. The few that did either or both of these things were not telling the same stories. Unfortunately the deeper analysis can be hard to follow, and laced with uncertainties. In the end, I am pleasantly surprised by the detail and transparency of CrowdStrike's RCA. I find it doubtful that a sloppy or incompetent team would provide such details about what went wrong or outline their fairly robust testing regimen and a number of planned improvements. in many ways CrowdStrike is modelling how a good team should handle such a crisis. Failure is an opportunity to learn and make improvements. I wonder how many affected customers will find their own opportunities to improve on the back of this event.

Channel File 291 Incident RCA is Available | CrowdStrike

crowdstrike.com
Like Comment
To view or add a comment, sign in

2,390 followers

View Profile Follow

Lijiang Fang’s Post

Falcon Content Update Preliminary Post Incident Report | CrowdStrike

crowdstrike.com

More from this author

On Prediction Accuracy

Frictions Drive Customer Away

Explore topics