TechEcho

11 comments

vjagrawal1984almost 6 years ago

In the face of so many outages from big companies, I wonder how Visa/MasterCard is so resilient.Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?

评论 #20425215 未加载

评论 #20425232 未加载

评论 #20425109 未加载

评论 #20425058 未加载

评论 #20425662 未加载

ssalazarsalmost 6 years ago

[2019-07-10 20:13 UTC] During our investigation into the root cause of the first event, we identified a code path likely causing the bug in a new minor version of the database’s election protocol. [2019-07-10 20:42 UTC] We rolled back to a previous minor version of the election protocol and monitored the rollout.There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.

评论 #20423264 未加载

评论 #20423106 未加载

评论 #20422944 未加载

评论 #20423641 未加载

评论 #20424029 未加载

评论 #20423045 未加载

laCouralmost 6 years ago

"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.

评论 #20423332 未加载

评论 #20423436 未加载

评论 #20422820 未加载

评论 #20422916 未加载

评论 #20422821 未加载

zbyalmost 6 years ago

So the article identifies a software bug and a software/config bug as the root cause. That sounds a bit shallow for such a high visibility case - I was expecting something like the <a href="https://en.wikipedia.org/wiki/5_Whys" rel="nofollow">https://en.wikipedia.org/wiki/5_Whys</a> method with subplots on why the bugs where not caught in testing. By the way I only clicked on it because I was hoping it would be an occasion to use the methods from <a href="http://bayes.cs.ucla.edu/WHY/" rel="nofollow">http://bayes.cs.ucla.edu/WHY/</a> - alas no - it was too shallow for that.

评论 #20428162 未加载

gr2020almost 6 years ago

Anybody know what database they’re using?

评论 #20422810 未加载

segmondyalmost 6 years ago

As I mentioned early, " human error often, configuration changes often, new changes often. " <a href="https://news.ycombinator.com/item?id=20406116" rel="nofollow">https://news.ycombinator.com/item?id=20406116</a>

chance_statealmost 6 years ago

This reads like the marketing/PR teams wrote much of it. Compare to the Cloudflare post-mortem from today: <a href="https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/" rel="nofollow">https://blog.cloudflare.com/details-of-the-cloudflare-outage...</a>

评论 #20422698 未加载

评论 #20422817 未加载

评论 #20422501 未加载

评论 #20422461 未加载

mualalmost 6 years ago

Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.

jacquesmalmost 6 years ago

Why don't they call 'significantly elevated error rates' an 'outage' instead?

评论 #20424131 未加载

评论 #20422849 未加载

评论 #20422844 未加载

luminatialmost 6 years ago

Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.

debtalmost 6 years ago

"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.

评论 #20422775 未加载

评论 #20423026 未加载

11 comments

vjagrawal1984almost 6 years ago

评论 #20425215 未加载

评论 #20425232 未加载

评论 #20425109 未加载

评论 #20425058 未加载

评论 #20425662 未加载

ssalazarsalmost 6 years ago

评论 #20423264 未加载

评论 #20423106 未加载

评论 #20422944 未加载

评论 #20423641 未加载

评论 #20424029 未加载

评论 #20423045 未加载

laCouralmost 6 years ago

"[Four days prior to the incident] Two nodes became stalled for yet-to-be-determined reasons."How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.

评论 #20423332 未加载

评论 #20423436 未加载

评论 #20422820 未加载

评论 #20422916 未加载

评论 #20422821 未加载

zbyalmost 6 years ago

评论 #20428162 未加载

gr2020almost 6 years ago

Anybody know what database they’re using?

评论 #20422810 未加载

segmondyalmost 6 years ago

chance_statealmost 6 years ago

评论 #20422698 未加载

评论 #20422817 未加载

评论 #20422501 未加载

评论 #20422461 未加载

mualalmost 6 years ago

Is this Stripe's first public RCA? Looking through their tweets, there do not appear to be other RCAs for the same "elevated error rates". It seems hard to conclude much from one RCA.

jacquesmalmost 6 years ago

Why don't they call 'significantly elevated error rates' an 'outage' instead?

评论 #20424131 未加载

评论 #20422849 未加载

评论 #20422844 未加载

luminatialmost 6 years ago

Since both companies' root cause analysis are currently trending on HN, it's pretty apparent that Stripe's engineering culture has a long ways to go catch up with Cloudflare's.

debtalmost 6 years ago

评论 #20422775 未加载

评论 #20423026 未加载

Root cause analysis: significantly elevated error rates on 2019‑07‑10

11 comments

Root cause analysis: significantly elevated error rates on 2019‑07‑10

11 comments