Does anyone else see the missing piece to this post mortem? An infinite loop made its way onto a majority(? all?) of production servers, and the immediate response is more or less 'we shouldn't have deployed to as many customers, failure should have only happened to a small subset'?<p>I agree that improvements made to their deployment tooling were good and necessary, take the human temptation to skip steps out of the equation.<p>But this exemplifies a <i>major</i> problem our industry suffers from, in that it just taken as a given that critical errors will sometimes make their way into production servers and the best we can do is reduce the impact.<p>I find this absolutely unacceptable. How about we short circuit the process and identify ways to stop that from happening? Were there enough code reviews? Did automated testing fail here? Yes I'm familiar with the halting problem and limitations of formal verification on turing complete languages, but I don't believe it's an excuse.<p>This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models".