<i>> 2:42 AM PDT / 7:42 UTC: Our on-call engineers restarted the redis-master to address the high load.</i><p>Over the years I've learned that restarting something that's under high load is not the correct solution to the problem.<p>By restarting the thing under load, you will always further increase the load when the thing you've restarted is coming back. Recovery will take longer and stuff could fail even more badly.<p>Of course there's the slim possibility that the thing under load has actually crashed or entered some infinite loop, but over the years I've learned that it's far more likely to be my fault than their fault; the abnormal load is far more likely caused by a problem at my end than a bug in the software that's misbehaving.<p>I know that just restarting is tempting, but I've seen so many times that spending the extra effort to analyze the root case usually is worth it.
Good for them for being open about this. Solid incident reports like this make me trust a vendor so much more. Not only do I have a good idea how they will handle a big public failure next time, but it tells me a lot about how they're handling the private issues, and therefore how robust their system is likely to be.<p>One minor suggestion: the root cause of an incident is never a technical problem. The technology is made by people, so you should always come back to the human systems that make a company go. Not so you can blame people. It's just the opposite: you want to find better ways to support people in the future.
I've been looking forward to reading this. It never fails to amaze me how these sort of incidents are caused by a cascade of small, unrelated problems, any one of which on its own would likely not have caused the end problem.<p>The post-mortem has plenty of details but without being wishy washy in any way. Just the fact's ma'am, ending with a sincere apology and steps to prevent a recurrence. Well done!
Taking a page from something I learned from MySQL replication tricks, have you thought about trying a master -> inner slaves -> slaves out peers? Where there's one inner slave per datacenter/region.<p>I know atleast with 1.2.6 a slave can be a slave of a slave but I never measured the latency of a write at master, through the inner peers, out to the fan out slaves. Admittedly a more complicated topology but it would circumvent the stampede's against a master instance and also makes it easier to spin up larger numbers of slaves without wiping out the entire platform.
It sounds like, by deciding to make a charge based on what amounts to a temporary cache of the balance, there has always been a race condition in the billing code? I would think it would have always been necessary to successfully update the balance before proceeding with the charge in order to avoid double charges.<p>I can see how this sort of situation is hard to predict and test for, but that means it needs to be designed and implemented very carefully. $0->charge card just isn't a reasonable approach at all.
I don't use Twilio, but this was an interesting write-up. After reading it, I can totally understand how important it is to have safe-guards and redundancies baked into user balance information. I like how they even mentioned (although a bit vaguely) how they plan to implement those additional protections [1].<p>[1] "We are now introducing robust fail-safes, so that if billing balances don’t exist or cannot be written, the system will not suspend accounts or charge credit cards. Finally, we will be updating the billing system to validate against our double-bookkeeping databases in real-time."
Maybe I missed this bit of information, but can someone explain....<p>They use a multi datacenter master - slave redis cluster? What's the relationship between the master and slaves?<p>Are the slaves just failover? Or are they for read only?<p>How have they configured their writes and reads from their main application? I'm just curious how their application routes the reads/writes in such a multi-database setup. (Are they using DNS for a manual failover?)
This seems to only explain a single faulty recharge, due to the customer using the service with a zero balance and triggering the charge attempt. Why would there be multiple recharge attempts? Was the customer using the service in that period multiple times triggering each of the recharge attempts or was it the code re-trying the transaction? If it was the code why would it restart the transaction from the top instead of just the part that failed - the balance update?
The double-bookkeeping Twilio is using -- what's the downside to it? Is it only updated per-minute with Redis doing the second-to-second usage and then rolled off into MySQL?<p>I'm curious how that's working, if you don't mind the question.