CrowdStrike has identified a bug in its Content Validator software as the cause of a widespread crash affecting 8.5 million Windows machines. According to the company, "Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."
Despite performing both automated and manual testing on Sensor Content and Template Types, CrowdStrike's testing of the Rapid Response Content—delivered on Friday—was not as thorough. A March deployment of new Template Types had provided "trust in the checks performed in the Content Validator," leading CrowdStrike to assume that the Rapid Response Content rollout would not cause issues.
This assumption proved costly, as the sensor loaded the problematic Rapid Response Content into its Content Interpreter, triggering an out-of-bounds memory exception. "This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)," explains CrowdStrike.
CrowdStrike has violated a fundamental rule of software testing: 'Never assume it will work.' This principle, the rule of 'No Assumption' in software testing, emphasizes the importance of thorough and consistent testing to ensure reliability and prevent such critical failures.
To prevent future occurrences, CrowdStrike has committed to enhancing its testing protocols for Rapid Response Content. This includes local developer testing, content update and rollback testing, stress testing, fuzzing, and fault injection. Stability testing and content interface testing will also be performed on Rapid Response Content.
CrowdStrike is also updating its cloud-based Content Validator to better scrutinize Rapid Response Content releases. "A new check is in process to guard against this type of problematic content from being deployed in the future," says CrowdStrike.
On the driver side, CrowdStrike will "enhance existing error handling in the Content Interpreter," which is part of the Falcon sensor. Additionally, CrowdStrike plans to implement a staggered deployment strategy for Rapid Response Content, gradually rolling out updates to larger portions of its install base rather than pushing updates to all systems at once. These driver improvements and staggered deployments have been recommended by security experts in recent days.
Staggered deployment, also known as phased rollout or incremental deployment, is a software release strategy where new features or updates are gradually introduced to users in stages rather than being deployed to all users simultaneously. This approach helps mitigate the risks associated with large-scale releases and allows for better management of potential issues. Key benefits of staggered deployment include:
1. Risk Mitigation: By releasing to a small subset of users initially, potential issues can be identified and addressed before they impact the entire user base.
2. Controlled Environment: Developers and IT teams can monitor the performance and stability of the new release in a controlled environment, making it easier to diagnose and fix problems.
3. User Feedback: Early adopters can provide valuable feedback that can be used to make improvements before the full rollout.
4. Resource Management: It allows for better management of server loads and other resources by avoiding a sudden spike in usage.
5. Rollback Capability: If a critical issue is discovered, it is easier to roll back changes affecting a smaller group of users.
Common strategies for staggered deployment include:
Canary Releases: Deploying the update to a small, random subset of users to test its impact.
Geographic Rollouts: Releasing the update to users in specific regions or countries in stages.
User Segmentation: Rolling out updates to users based on specific characteristics, such as subscription plans or usage patterns.
Staggered deployment helps ensure a smoother, more reliable update process, ultimately leading to a better user experience.
Comments