TechEcho

8 comments

zackmorrisalmost 12 years ago

What caught my attention was where Twilio said the redis-slaves were timing out to the redis-master:<a href="http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html" rel="nofollow">http://www.twilio.com/blog/2013/07/billing-incident-post-mor...</a>I think timeouts should be abolished for the vast majority of software today.The usual reasoning goes something like this: for a TCP connection, if you don't hear from the server for some period of time, you can assume that something is "wrong" and drop the connection. The fallacy is, the TCP connection is not really important to the shared state of two devices. From the very beginning (I'm talking 1970s!), devices should have been using tokens to identify one another, regardless of communication status. The tokens could be saved in nonvolatile memory on servers so that jobs could always continue where they left off.Instead we have a whole slew of nondeterministic pathological cases -exactly- like the one that hit Twilio. If you take on the burden of timeouts, you end up with dozens of places in your code (even more, potentially) where you just don't know what to do if you lose communication.If you don't take on the burden of timeouts, then you can just track each connection and all it costs you is storage space, which is practically free today and getting cheaper every year. With credentials from the client, you don't even have to worry about duplicate connections. You can write your client-server code deterministically and stick to the logic, and easily stress test failure modes.

评论 #6100070 未加载

评论 #6101788 未加载

评论 #6100667 未加载

aidosalmost 12 years ago

Very clear and thoughtful post from antirez, as ever.It's worth reading his post on how persistence works in Redis (and other dbs). It's very interesting and gives great insights as to what goes on down in dbs to try to keep our data safe - particularly for those of us how don't ever interact with that layer directly.<a href="http://oldblog.antirez.com/post/redis-persistence-demystified.html" rel="nofollow">http://oldblog.antirez.com/post/redis-persistence-demystifie...</a>

eblumealmost 12 years ago

It's good to see Twilio post this! That being said - yeah, I really am concerned that Twilio is using an ephemeral database to store such important data. Why not simply use Postgres? Is Twilio really making so many transactions per second that Postgres won't scale?

评论 #6099517 未加载

评论 #6099578 未加载

评论 #6099599 未加载

评论 #6099494 未加载

评论 #6099897 未加载

评论 #6099532 未加载

评论 #6099970 未加载

评论 #6101611 未加载

评论 #6099518 未加载

mountaineeralmost 12 years ago

Here's the Twilio post-mortem thread on HN: <a href="https://news.ycombinator.com/item?id=6093954" rel="nofollow">https://news.ycombinator.com/item?id=6093954</a>

评论 #6099986 未加载

mbillie1almost 12 years ago

I'm curious if you're using anything other than redis-cli to set the master/slave relationships, and if you have any failover mechanism. I've used corosync/pacemaker for a high-availability redis cluster, but without an awful lot of confidence (we likely misconfigured it, to be fair).Just "slaveof <masterip>" and other redis-cli commands? Or are you using any automated process?Or has anyone else got a great redis failover/HA solution that they'd care to share?(I apologize for this having nothing to do with Twilio; I'm just curious)

评论 #6102045 未加载

mountaineeralmost 12 years ago

Twilio definitely uses ec2, it's been an oft-highlighted choice in many presentations and posts over the years.- <a href="http://www.slideshare.net/twilio/twilio-voice-applications-with-amazon-aws-s3-and-ec2-presentation" rel="nofollow">http://www.slideshare.net/twilio/twilio-voice-applications-w...</a>- <a href="http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-affected-by-todays-aws-issues/" rel="nofollow">http://www.twilio.com/engineering/2011/04/22/why-twilio-wasn...</a>

Vitalyalmost 12 years ago

Just like I commented on the original incident report post, I think systems like Redis are not suitable to work as a db for payment processing and transaction storage. Reading through the report I can't imagine something like this happening with a payment system built around Postgres. Not unless you are doing something incredibly stupid. And stupid those guys are not.They are obviously bright guys meaning well, and yet they've designed and implemented payment system with such a bad failure mode.I do understand that they have a LOT of billing events, and have to update customer billable amounts for each of them. But instead of holding the customer balances in Redis and doing payment processing on top of that, my paranoia would most probably lead me to only store 'amount to charge' in Redis and update it as frequently as needed, and store customer balances and transactions in an RDBMS. And only change during actual charge event. This way, if Redis data were to be lost, I'd under-charge my customers and not over-double-tripple charge them. The failure mode becomes less disastrous.

评论 #6101279 未加载

MichaelGGalmost 12 years ago

I do not understand why, when updating a balance from a CC transaction, you wouldn't be using transactions.<pre><code> Start Transaction Update Balances Call CC Processor Commit </code></pre> That would eliminate "the billing system charged customer credit cards to increase account balances without being able to update the balances themselves" -- you don't go call a non-transactional CC processor until you've actually been able to process the update in your own system (which you can easily rollback).If you're worried about Commits failing (due to not using pessimistic locking, for instance), then separate it into two transactions. That way when you go to process the CC the next time, you have a record stating there's already a transaction in-flight.For financial records, I'd expect a bit more care. Sounds like they had proper records, but only as a backup/logging.(Even for telecom, in which I work. There are fully ACID databases that have no problems handling millions of transactions/sec. In-flight balance information is trivial to handle.)

评论 #6101397 未加载

评论 #6100263 未加载

8 comments

zackmorrisalmost 12 years ago

评论 #6100070 未加载

评论 #6101788 未加载

评论 #6100667 未加载

aidosalmost 12 years ago

eblumealmost 12 years ago

评论 #6099517 未加载

评论 #6099578 未加载

评论 #6099599 未加载

评论 #6099494 未加载

评论 #6099897 未加载

评论 #6099532 未加载

评论 #6099970 未加载

评论 #6101611 未加载

评论 #6099518 未加载

mountaineeralmost 12 years ago

Here's the Twilio post-mortem thread on HN: <a href="https://news.ycombinator.com/item?id=6093954" rel="nofollow">https://news.ycombinator.com/item?id=6093954</a>

Twilio incident and Redis

8 comments

Twilio incident and Redis

8 comments