In the face of so many outages from big companies, I wonder how Visa/MasterCard is so resilient.<p>Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?
[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol.
[2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.<p>There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.<p>To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."<p>How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.
So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the <a href="https://en.wikipedia.org/wiki/5_Whys" rel="nofollow">https://en.wikipedia.org/wiki/5_Whys</a> method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from <a href="http://bayes.cs.ucla.edu/WHY/" rel="nofollow">http://bayes.cs.ucla.edu/WHY/</a> - alas no - it was too shallow for that.
As I mentioned early,
" human error often, configuration changes often, new changes often. "
<a href="https://news.ycombinator.com/item?id=20406116" rel="nofollow">https://news.ycombinator.com/item?id=20406116</a>
This reads like the marketing/PR teams wrote much of it. Compare to the Cloudflare post-mortem from today: <a href="https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/" rel="nofollow">https://blog.cloudflare.com/details-of-the-cloudflare-outage...</a>
Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.
Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.
"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."<p>Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.<p>It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.<p>One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.