Lijiang Fang’s Post

View profile for Lijiang Fang, graphic

Engineering Leader | ex-Microsoft VP | ex-Meta

This is the explanation from Crowdstrike about the root cause of last week's BSOD. The testing failed to detect the buggy file, and then the buggy file was deployed to all customer machines at the same time. The industry has learned and developed techniques and practices to safeguard production deployments for a long time. Unfortunately, those practices have not been widely adopted. This incident reminds me of the work we did at Microsoft Autopilot a decade ago. We learned that even though a code change passed testing, it could still bring down the system with real-world inputs. So, we developed safe application deployment. App changes were only applied to a small percentage of machines in an online cluster. A watchdog closely monitored the health. It then gradually increased the percentage of machines when the applications were all healthy. Similarly, for data changes. We also learned that even a simple one-bit change in the config could bring down a system. So, we developed safe config deployment. Similarly, the config change was only applied to a small percentage of machines and closely monitored. Only when the change did not cause issues, was it then increased to more machines. Even with such rigorous monitoring, sometimes failures could still be missed by these automated monitoring mechanisms. Staged deployment was then developed. An application, data, or config change would first be applied to a certain region. After waiting for some time, if no issues were reported, it would move to another region, and then another region. I hope such safe rollout practices will be widely adopted, especially for services that have huge footprints. https://lnkd.in/ge7Vaqrs

Falcon Content Update Preliminary Post Incident Report | CrowdStrike

Falcon Content Update Preliminary Post Incident Report | CrowdStrike

crowdstrike.com

Qichao L.

Product Owner at Broadcom Inc. | AI & Machine Learning | USC Marshall MBA

2mo

Love this

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics