Great information on the outage. Looks like another version 2 syndrome side effect [1].<p>> <i>Today, in an effort to reclaim some technical debt, we deployed new code that introduced Gatebot to Provision API.</i><p>> <i>What we did not account for, and what Provision API didn’t know about, was that 1.1.1.0/24 and 1.0.0.0/24 are special IP ranges. Frankly speaking, almost every IP range is "special" for one reason or another, since our IP configuration is rather complex. But our recursive DNS resolver ranges are even more special: they are relatively new, and we're using them in a very unique way. Our hardcoded list of Cloudflare addresses contained a manual exception specifically for these ranges.</i><p>> <i>As you might be able to guess by now, we didn't implement this manual exception while we were doing the integration work. Remember, the whole idea of the fix was to remove the hardcoded gotchas!</i><p>When porting legacy code it is not only important to understand the edge cases and technical debt built up over time, but to test more heavily in production because you never know if you got them all because some smart guy built them long ago and/or there are unknown hacks that were cornerstones of the system for better or worse.<p>Phased and alpha/beta rollouts in an almost A/B testing way is good for replacement systems. Version 2 systems can also add new attack vectors or other single points of failure that aren't as know from legacy problems, the Provision API seems like it is a candidate for that.<p>Over time the Version 2 system will be hardened just before it is EOL and replaced again to fix all the new problems that arise over time. Version 2's do innovate but they also shroud fixing old issues and pain points for new unknown problems.<p>[1] <a href="https://en.wikipedia.org/wiki/Second-system_effect" rel="nofollow">https://en.wikipedia.org/wiki/Second-system_effect</a>