TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Root cause analysis: significantly elevated error rates on 2019‑07‑10

203 pointsby gr2020almost 6 years ago

11 comments

vjagrawal1984almost 6 years ago
In the face of so many outages from big companies, I wonder how Visa&#x2F;MasterCard is so resilient.<p>Is it because they are over the curve and don&#x27;t make &quot;any&quot; changes to their system. As opposed to other companies, we are still maturing?
评论 #20425215 未加载
评论 #20425232 未加载
评论 #20425109 未加载
评论 #20425058 未加载
评论 #20425662 未加载
ssalazarsalmost 6 years ago
[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol. [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.<p>There&#x27;s a 20 minute gap between investigation and &quot;rollback&quot;. Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.<p>To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
评论 #20423264 未加载
评论 #20423106 未加载
评论 #20422944 未加载
评论 #20423641 未加载
评论 #20424029 未加载
评论 #20423045 未加载
laCouralmost 6 years ago
&quot;[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons.&quot;<p>How did they not catch this? It&#x27;s super surprising to me that they wouldn&#x27;t have monitors for this.
评论 #20423332 未加载
评论 #20423436 未加载
评论 #20422820 未加载
评论 #20422916 未加载
评论 #20422821 未加载
zbyalmost 6 years ago
So the article identifies a software bug and a software&#x2F;config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;5_Whys" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;5_Whys</a> method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from <a href="http:&#x2F;&#x2F;bayes.cs.ucla.edu&#x2F;WHY&#x2F;" rel="nofollow">http:&#x2F;&#x2F;bayes.cs.ucla.edu&#x2F;WHY&#x2F;</a> - alas no - it was too shallow for that.
评论 #20428162 未加载
gr2020almost 6 years ago
Anybody know what database they’re using?
评论 #20422810 未加载
segmondyalmost 6 years ago
As I mentioned early, &quot; human error often, configuration changes often, new changes often. &quot; <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20406116" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20406116</a>
chance_statealmost 6 years ago
This reads like the marketing&#x2F;PR teams wrote much of it. Compare to the Cloudflare post-mortem from today: <a href="https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;details-of-the-cloudflare-outage-on-july-2-2019&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;details-of-the-cloudflare-outage...</a>
评论 #20422698 未加载
评论 #20422817 未加载
评论 #20422501 未加载
评论 #20422461 未加载
mualalmost 6 years ago
Is this Stripe&#x27;s first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same &quot;elevated error rates&quot;. It seems hard to conclude much from one RCA.
jacquesmalmost 6 years ago
Why don&#x27;t they call &#x27;significantly elevated error rates&#x27; an &#x27;outage&#x27; instead?
评论 #20424131 未加载
评论 #20422849 未加载
评论 #20422844 未加载
luminatialmost 6 years ago
Since both companies&#x27; root cause analysis are currently trending on HN, it&#x27;s pretty apparent that Stripe&#x27;s engineering culture has a long ways to go catch up with Cloudflare&#x27;s.
debtalmost 6 years ago
&quot;We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation.&quot;<p>Damn what a mess. Sounds like y&#x27;all are rolling out way to many changes too quickly with little to no time for integration testing.<p>It&#x27;s a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.<p>One solution I don&#x27;t see mentioned, don&#x27;t upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you&#x27;re rolling back as well.
评论 #20422775 未加载
评论 #20423026 未加载