This is the explanation from Crowdstrike about the root cause of last week's BSOD. The testing failed to detect the buggy file, and then the buggy file was deployed to all customer machines at the same time. The industry has learned and developed techniques and practices to safeguard production deployments for a long time. Unfortunately, those practices have not been widely adopted. This incident reminds me of the work we did at Microsoft Autopilot a decade ago. We learned that even though a code change passed testing, it could still bring down the system with real-world inputs. So, we developed safe application deployment. App changes were only applied to a small percentage of machines in an online cluster. A watchdog closely monitored the health. It then gradually increased the percentage of machines when the applications were all healthy. Similarly, for data changes. We also learned that even a simple one-bit change in the config could bring down a system. So, we developed safe config deployment. Similarly, the config change was only applied to a small percentage of machines and closely monitored. Only when the change did not cause issues, was it then increased to more machines. Even with such rigorous monitoring, sometimes failures could still be missed by these automated monitoring mechanisms. Staged deployment was then developed. An application, data, or config change would first be applied to a certain region. After waiting for some time, if no issues were reported, it would move to another region, and then another region. I hope such safe rollout practices will be widely adopted, especially for services that have huge footprints. https://lnkd.in/ge7Vaqrs
Well said!
Product Owner at Broadcom Inc. | AI & Machine Learning | USC Marshall MBA
2moLove this