As more details emerge about the recent CrowdStrike-Microsoft incident—a problematic software update that took down thousands of Windows-based machines—it becomes evident that there are significant opportunities for improvement in software development and release processes. These improvement opportunities can be applied to your processes as well. In this post, we will briefly discuss what went wrong at CrowdStrike, then go over 12 lessons learned accompanied by tips to improve your QA and release cycles to avoid similar catastrophes.
The recap
On the evening of July 19th, 2024, CrowdStrike sent a faulty update to its flagship security product "Falcon," causing thousands of Windows 10 and 11 machines and servers with Falcon installed to halt with a blue screen (a.k.a. BSOD) after each reboot. The initial fix proposed involved booting each machine into "safe mode," manually finding and removing the offending file, and rebooting. A few days later, a cloud remediation option was made available to customers.
Lessons learned
Considering industry reviews and public announcements from CrowdStrike, here are 12 lessons learned to improve your Software Development and Release processes with software quality assurance:
- Unit testing in fresh environments: Unit testing is essential, but tests should not only run on the developer's computer (avoiding the "works on my machine" problem). Whenever possible, tests should be executed in newly created environments. Ephemeral test environments, customized virtual machines (VMs), or Infrastructure as Code (IaC) solutions like Terraform or Ansible can help achieve this.
- Tip: Use cloud services like AWS, Azure, or Google Cloud to quickly spin up fresh environments for testing.
- Ensure code coverage: Extend your unit testing to cover the actual changes in the codebase. Avoid running the same old tests repeatedly without assessing new code changes. Use code coverage tools to identify untested parts of your codebase.
- Tip: Tools like JaCoCo for Java, Istanbul for JavaScript, or Coverlet for .NET can help measure code coverage.
- Verify your artifacts: Configure your releases to sign or at least hash all artifacts accompanying your code. Implement a signature/hash verification process that precedes execution in both lower and production environments. This is crucial for systems that receive updates or deal with external files.
- Tip: Use GPG or similar tools to sign artifacts and ensure their integrity before deployment.
- Evolve your negative testing: Continuously brainstorm with your team on potential failure modes of your code and write tests to trigger these failures. This approach helps in improving error and exception handling. Start with common or simple failure scenarios and progressively cover unlikely ones.
- Tip: Use chaos engineering tools like Chaos Monkey to simulate failures and test your system’s resilience.
- Enhance logging capabilities: Structured and informative logs are vital for debugging and testing. Effective logging makes it easier to identify the cause of failures. Implement a dynamic logging system that can be tuned to different verbosity levels based on needs and conditions.
- Tip: Libraries like Log4j for Java, Serilog for .NET, or Winston for Node.js provide robust logging capabilities.
- Improve integration testing: When your code has external dependencies (especially from third parties), adopt a defensive test strategy. Force failures to ensure your code handles them gracefully and recovers effectively.
- Tip: Use integration testing frameworks like Postman for API testing or Test Containers for running tests with real dependencies in isolated environments
- Never fail silently: Ensure all conditional statements (e.g., if, else) and exception handling (e.g., try/catch) contain code, even if it's only for logging. Silent failures can hide critical issues.
- Tip: Implement comprehensive error handling strategies and always log unexpected conditions.
- Validate and sanitize inputs: If your code loads files or receives user input, enhance input validation and sanitization. Whenever possible, handle these actions in separate processes or threads to prevent exceptions from crashing the main process.
- Tip: Use libraries like OWASP ESAPI for Java or even RegularExpressions for input validation.
- Follow the least privilege principle: Run your code with only the necessary privileges. Identify and test the required permissions instead of running everything as "root" or "administrator."
- Tip: Use role-based access control (RBAC) and the principle of least privilege (PoLP) to minimize security risks.
- Leverage testware in CI: Integrate automated tests and quality checks that make sense into your CI/CD pipeline. Regularly review and update your test configuration to include new or updated testware.
- Tip: Tools like Jenkins, GitHub Actions, or GitLab CI/CD can automate your testing and deployment processes effectively.
- Consider alternative release strategies: Techniques like AB testing, Targeted Rollouts, Staggered Rollouts, or Feature Flagging are typically less risky than full-rollout releases. Ensure you have system health monitoring in place, especially for highly critical deployments.
- Tip: Use feature flagging tools like LaunchDarkly or Unleash to control feature releases and mitigate risks.
- Carefully consider release dates and times: For critical systems, avoid releasing updates on Friday evenings or before major holidays. Even with thorough testing and previous successful deployments, plan ahead for comprehensive testing.
- Tip: Schedule releases during periods of low activity and ensure support staff availability to handle potential issues.
Conclusion
When it comes to software development, human errors can occur at every step of the way. Reviewing and constantly enhancing your software development practices can have a significant impact on mitigating such errors.
At Softtek, we have developed a fully customizable software quality management approach that adapts to your organizational context, product needs, tech stack, and team dynamics. This helps you adopt industry best practices, adhere to regulations, and maintain lean, observable, and traceable testing processes.
Visit our Nearshore QA Testing page to discover how global nearshore services, expert testers, and AI combine to elevate quality quickly and at enterprise scale.