TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause

77 pointsby RobSpectrealmost 12 years ago

10 comments

pilifalmost 12 years ago
<i>&gt; 2:42 AM PDT &#x2F; 7:42 UTC: Our on-call engineers restarted the redis-master to address the high load.</i><p>Over the years I&#x27;ve learned that restarting something that&#x27;s under high load is not the correct solution to the problem.<p>By restarting the thing under load, you will always further increase the load when the thing you&#x27;ve restarted is coming back. Recovery will take longer and stuff could fail even more badly.<p>Of course there&#x27;s the slim possibility that the thing under load has actually crashed or entered some infinite loop, but over the years I&#x27;ve learned that it&#x27;s far more likely to be my fault than their fault; the abnormal load is far more likely caused by a problem at my end than a bug in the software that&#x27;s misbehaving.<p>I know that just restarting is tempting, but I&#x27;ve seen so many times that spending the extra effort to analyze the root case usually is worth it.
评论 #6094556 未加载
评论 #6094834 未加载
wpietrialmost 12 years ago
Good for them for being open about this. Solid incident reports like this make me trust a vendor so much more. Not only do I have a good idea how they will handle a big public failure next time, but it tells me a lot about how they&#x27;re handling the private issues, and therefore how robust their system is likely to be.<p>One minor suggestion: the root cause of an incident is never a technical problem. The technology is made by people, so you should always come back to the human systems that make a company go. Not so you can blame people. It&#x27;s just the opposite: you want to find better ways to support people in the future.
评论 #6094572 未加载
ajtayloralmost 12 years ago
I&#x27;ve been looking forward to reading this. It never fails to amaze me how these sort of incidents are caused by a cascade of small, unrelated problems, any one of which on its own would likely not have caused the end problem.<p>The post-mortem has plenty of details but without being wishy washy in any way. Just the fact&#x27;s ma&#x27;am, ending with a sincere apology and steps to prevent a recurrence. Well done!
评论 #6094581 未加载
评论 #6094671 未加载
CptCodeMonkeyalmost 12 years ago
Taking a page from something I learned from MySQL replication tricks, have you thought about trying a master -&gt; inner slaves -&gt; slaves out peers? Where there&#x27;s one inner slave per datacenter&#x2F;region.<p>I know atleast with 1.2.6 a slave can be a slave of a slave but I never measured the latency of a write at master, through the inner peers, out to the fan out slaves. Admittedly a more complicated topology but it would circumvent the stampede&#x27;s against a master instance and also makes it easier to spin up larger numbers of slaves without wiping out the entire platform.
评论 #6094640 未加载
pkteisonalmost 12 years ago
It sounds like, by deciding to make a charge based on what amounts to a temporary cache of the balance, there has always been a race condition in the billing code? I would think it would have always been necessary to successfully update the balance before proceeding with the charge in order to avoid double charges.<p>I can see how this sort of situation is hard to predict and test for, but that means it needs to be designed and implemented very carefully. $0-&gt;charge card just isn&#x27;t a reasonable approach at all.
statusgraphalmost 12 years ago
Interesting to slave redis across datacenters. I wonder how well that works in practice.
评论 #6094614 未加载
chill1almost 12 years ago
I don&#x27;t use Twilio, but this was an interesting write-up. After reading it, I can totally understand how important it is to have safe-guards and redundancies baked into user balance information. I like how they even mentioned (although a bit vaguely) how they plan to implement those additional protections [1].<p>[1] &quot;We are now introducing robust fail-safes, so that if billing balances don’t exist or cannot be written, the system will not suspend accounts or charge credit cards. Finally, we will be updating the billing system to validate against our double-bookkeeping databases in real-time.&quot;
评论 #6094585 未加载
richardvalmost 12 years ago
Maybe I missed this bit of information, but can someone explain....<p>They use a multi datacenter master - slave redis cluster? What&#x27;s the relationship between the master and slaves?<p>Are the slaves just failover? Or are they for read only?<p>How have they configured their writes and reads from their main application? I&#x27;m just curious how their application routes the reads&#x2F;writes in such a multi-database setup. (Are they using DNS for a manual failover?)
评论 #6097897 未加载
ahkalmost 12 years ago
This seems to only explain a single faulty recharge, due to the customer using the service with a zero balance and triggering the charge attempt. Why would there be multiple recharge attempts? Was the customer using the service in that period multiple times triggering each of the recharge attempts or was it the code re-trying the transaction? If it was the code why would it restart the transaction from the top instead of just the part that failed - the balance update?
评论 #6094552 未加载
评论 #6094557 未加载
评论 #6094559 未加载
Xorlevalmost 12 years ago
The double-bookkeeping Twilio is using -- what&#x27;s the downside to it? Is it only updated per-minute with Redis doing the second-to-second usage and then rolled off into MySQL?<p>I&#x27;m curious how that&#x27;s working, if you don&#x27;t mind the question.
评论 #6094631 未加载