> We make software deployments constantly across the network and have automated systems to run test suites and a procedure for deploying progressively to prevent incidents.<p>Good.<p>> Unfortunately, these WAF rules were deployed globally in one go and caused today’s outage.<p>Wow. This seems like a very immature operational stance.<p>Any deployment of any kind should be subject to minimum deployment safety, that they claim they have.<p>> At 1402 UTC we understood what was happening and decided to issue a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic. That occurred at 1409 UTC.<p>Many large companies would have had automatic roll-back of this kind of change in less time than it took CloudFlare to (apparently) have humans decide to roll-back, and possibly before a single (usually not global) deployment had actually completed on all hosts/instances.<p>However, what is more concerning is that it seems you shouldn't <i>rely</i> on CloudFlare's "WAF Managed Rulesets" at all, since they seem to be willing to turn it off instead of correctly rolling back a bad deployment, which they only did > 43 minutes later:<p>> We then went on to review the offending pull request, roll back the specific rules, test the change to ensure that we were 100% certain that we had the correct fix, and re-enabled the WAF Managed Rulesets at 1452 UTC.<p>How were they not able to trivially roll back to the previous deployment?