科技回声

8 条评论

cle将近 6 年前

Lots of architectural and cultural problems IMO. Mixing different kinds of queries on the same cluster, no auto scaling on neither cluster nor web server, they seem to be okay with customer-impacting maintenance events (seriously?!), and their "fix" for an event caused by cache misses is to add more caching, which will make their system even harder to understand and predict, increasing the likelihood of more severe and byzantine failures.It's often an unpopular opinion around here, but this is why I prefer simple hosted databases with limited query flexibility for high volume and high availability services (Firestore, DynamoDB, etc.). It's harder to be surprised by expensive queries, and you won't have to fiddle with failovers, auto scaling, caching, etc. Design your system around their constraints and it will have predictable performance and can more easily scale under unexpected load.

viraptor将近 6 年前

> We have ensured that failovers for this cluster may only be initiated during rare, scheduled downtime, when there will be no impact on customers.I hope all their hardware crashes are also scheduled when there will be no impact... This seems a bit backwards - unless you constantly exercise the instantaneous failover, how do you know it works?Edit: Actually it's worse - if you don't test the instant failover under a full load, how do you know it's still instant then?

评论 #20550595 未加载

twblalock将近 6 年前

I suppose this is a good time to ask whether Coinbase thinks they are a bank, or a brokerage, or both.If they are a bank, this isn't the end of the world. I've had online banking outages at "normal" banks. It is still a bad thing, but there are other ways I can get my money, like going to a branch.On the other hand, if Coinbase is like a brokerage, this is really bad. And let's face it, most use of crypto is for investment and speculation purposes. For trades to fail for half an hour is really bad. If they are running this thing like a startup on MongoDB (seriously?) I don't see how anyone who puts their money in can have any confidence of getting it back out.

评论 #20552021 未加载

staticassertion将近 6 年前

> Before the incident began, a background job performed an aggregation for a large number of accounts, causing excessive reads into cache and cache evictions for the cluster’s storage engine cache.I thought this was interesting. I think that caches can be so dangerous in an incident - suddenly operations that are almost always constant time are executing in a much different complexity, and worst is that this tends to happen when you get backups (since old, uncached data is suddenly pushing recent data out).I think chaos engineering may be a good solution here, in lieu of better architectures - see what happens when you clear your cache every once in a while, how much your load changes, how your systems scale to deal with it.

drefanzor将近 6 年前

So basically price alerts lagged the system.

评论 #20550699 未加载

kvlr将近 6 年前

TLDR: MongoDB

评论 #20549598 未加载

评论 #20551199 未加载

评论 #20550363 未加载

redis_mlc将近 6 年前

> We take uptime very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrencyNo you don't.- If you did, you'd hire a DBA team and they would be familiar with the various jobs in your environment. But first your founders would have to have respect for Operations, which will take a dozen more major outages.The other major Coinbase outages have also been database-related, namely missing indexes.- If you did, you wouldn't be doing major database (or other production) changes at 3 pm in the afternoon.So let's cut to the chase. You prioritize features over Operations, and as a result guinea-pig your users. Just like any other SF startup. So just admit that to your end-users.

评论 #20550789 未加载

评论 #20551234 未加载

评论 #20550976 未加载

nocitrek将近 6 年前

Great post mortem. Well detailed, good work ethic there.

8 条评论

cle将近 6 年前

viraptor将近 6 年前

评论 #20550595 未加载

twblalock将近 6 年前

评论 #20552021 未加载

staticassertion将近 6 年前

drefanzor将近 6 年前

So basically price alerts lagged the system.

评论 #20550699 未加载

kvlr将近 6 年前

TLDR: MongoDB

评论 #20549598 未加载

评论 #20551199 未加载

评论 #20550363 未加载

redis_mlc将近 6 年前

评论 #20550789 未加载

评论 #20551234 未加载

评论 #20550976 未加载

nocitrek将近 6 年前

Great post mortem. Well detailed, good work ethic there.

Coinbase Incident Post Mortem: June 25–26, 2019

8 条评论

Coinbase Incident Post Mortem: June 25–26, 2019

8 条评论