Dynamo: A flawed architecture

49 点作者 ypavan超过 15 年前

11 条评论

werner超过 15 年前

Dynamo is still in use, but as with all technologies that have to operate at Amazon's scale, the systems evolve rapidly. The storage systems in use now no longer look like the ones from 5 years ago when Dynamo was developed.The Dynamo SOSP paper had two goals: 1) show how systems are a composition of techniques and how all of these need to work together build a production system 2) given that it was based on a variety of research results it was intended to give feedback to the academic community about the difficulties of moving from research results to production, and what matters in real-life vs production.The paper was never intended to be a complete blueprint for easy design of follow-up systems.There just isn't enough room in an academic paper to do justice to that. If it was going to have any role in that the best we could hope for was as a collection of points you would have to think hard about and make decisions about when you were going to design your own storage engine.That doesn't means I agree with the conclusions of the analysis, on the contrary, I think they are seriously flawed as well (as with everything else that is not absolutely perfect :-)). But I can understand that when people look for the dynamo paper to be a blueprint that solves all their storage needs and provides a perfect available service under all failure scenarios (and solves world peace), that they may be left with a few questions afterwards.I always thought that the real contribution of the paper was that it made you think hard about the trade-offs you are faced with when you have to design high-available, ultra-scalable systems that are cost-effective and provide guaranteed performance. 5 years later we know a lot more but this stuff is still hard, and we still need to balance rigorous principles with production magic to make it work. But you do need to fully understand the principles before you can make the production trade offs.Caveat: some of these remarks were tongue-in-cheek fun; I leave it to the reader which ones :-)

jbellis超过 15 年前

Basically, the author doesn't understand quorum protocols and draws a bunch of erroneous conclusions because of that.Here's the core misunderstanding:"It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) - one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster - but are admitted back to the quorum group without any resynchronization barrier."That's because adding a barrier (a) isn't necessary for the consistency guarantee, (b) doesn't add any extra safety in the worst case, and (c) adds complexity (always be suspicious of complexity!)Consider the simple case of a 3-node cluster (A, B, C) with N=3.For quorum reads and writes, R = W = 2.Then reads will be consistent if any two nodes are up. It doesn't matter if nodes A and B are up for the write, then B and C are up for the read -- the reader will still always see the latest write.Of course you will have to block writes and reads if more than one node is down for either operation. This is what the dynamo paper means when it talks about allowing clients to decide how much availability to trade for consistency.The rest of the article is basically variations of this theme. (The only way to provide "strong consistency is to read from all the replicas all the time"!? Wrong, wrong, wrong.)

评论 #915699 未加载

评论 #918047 未加载

mbrubeck超过 15 年前

[Disclosure: I used to work at Amazon, but not on Dynamo or Cart, and have no non-public information about those systems.]I think this is based on a flawed application of metrics from other organizations that do not match Amazon's actual business needs.The author talks about bank transaction systems with five nines of availability. Sure, you can build a centralized system with a critical core that has near-zero downtime - usually at great operational cost - but there's not a bank in the world that's actually 99.999% available to customers, over the internet. Most retail banks I've used take several hours of outages per month for maintenance on their web services.Amazon wants its systems not just to keep running but to keep taking orders from millions of people all over the world. This means that they are concerned with the reliability not just of a storage server, but everything needed to connect it to its application servers and to customers. They have found that the cost-effective way to do this is to distribute every component across geographically separated datacenters. Amazon has remained available through real and simulated datacenter-wide outages ranging from power/cooling failures to floods, fire, and hurricanes. No Amazon system lives within a single building, much less a single network switch or rack.Finally, although the formally provable aspects of Dynamo's "eventual consistency" guarantee may be vague, any team operating a distributed system will study and understand the actual operational characteristics in practice, under both normal conditions and various failure modes. Some systems may have realistic failures that lead to days or hours of inconsistency (in which case the team has deliberately chosen this and will write client software to be aware of it); others might be tuned to achieve consistency within milliseconds under normal operation and to set off alarms within seconds after a failure. I've never known any that would be used in circumstances where "human lifetimes" are a relevant time period.

HenryR超过 15 年前

I don't agree with all his conclusions. In particular, it seems as though he is saying 'eventual consistency is no practical good' and then beating Dynamo for a few paragraphs with the same stick.The most often quoted example of Dynamo's use is the shopping cart application on every Amazon page. In the worst case, your shopping cart will mysteriously empty itself. This is a huge pain, and a potential loss for Amazon, but it's not catastrophic in the way that is implied here. Indeed, assuming the liveness of a quorum, the application will read back all conflicting entries for the shopping cart (those that aren't ordered under their vector clock timestamps) and the onus is on it to merge the conflicts. Of course, the shopping cart will take the union of all updates to ensure that nothing is dropped (and therefore some delete operations may be lost).The key point is that some applications can do without observing a linearisable history, and the interest of this paper is that it explores the design space if you drop that requirement.I don't understand the post's points about CAP; all three requirements are in tension. Dynamo is unusual in that it is live in the case of a network partition while still maintaining its consistency guarantees.Similarly - those systems that use chain-replication asynchronously like he describes can still suffer from the same read-old-value-after-it-was-written consistency issue, if the reader jumps between two replicas for consecutive reads. Avoiding that can require synchronous coordination of updates (a la Paxos, e.g.) which is, I think, what the paper is driving at. Otherwise, there are still failure modes which, in order to patch up, require stronger guarantees about liveness of quorums than Dynamo needs.I understand that Dynamo is no longer used internally at Amazon at scale, so maybe some of the practical points this post makes about the realities of central coordination held water for real deployments. Still, I don't buy the reaction that prioritising availability uber alles and designing a system that does not behave exactly like a strongly-consistent key-value store immediately invalidates it for workloads that have high availability requirements and lower consistency needs.

评论 #915366 未加载

评论 #918158 未加载

ynniv超过 15 年前

Author somewhat intelligently argues that Dynamo sacrifices more consistency than it gains availability. Part of his argument is the difficulty of truly leveraging partitioning to gain availability, mostly the practical issues of users finding the currently available nodes in such a system. Another is issues arising from nodes re-joining a cluster without first performing a synchronization. He seems to favor the BigTable family, or simple master-slave replication (assuming that the write-load is low enough for this to be acceptable).

评论 #915495 未加载

jsensarma超过 15 年前

@everyone - thanks for all the comments. i have incorporated the multiple comments about the data loss scenario being incorrect when using vector clocks. i have been thinking too much about Cassandra lately - and it doesn't use vector clocks. that said - i continue to believe that returning stale reads is bad and best avoided and that unbounded stale-ness is not acceptable for many applications. To the extent that this is an avoidable scenario in a tightly coupled environment within a single data center - i consider it to be a significant drawback.@jbellis - i hope the responses to your comment have convinced you about the problems with the dynamo quorum scheme/read-write protocols. i can convey that the problems i described definitely do exist in Cassandra. Jun Rao has made this point in the Cassandra public mailing lists a long time back and i have a pointer to the JIRA that he filed in my post as well.regarding Dynamo being an interesting design space and for the academic community etc. That may very well have been the case - but the reality of the situation is that the world is now overflowing the Dynamo clones with people considering them for all kinds of usages. Hey - if it was good for Amazon (and Facebook and LinkedIn) - it's probably good for me! The people trying to use Dynamo clones do not understand all the small details. They don't understand what applications would be safe to write on it and which would not. I hope my posts (imperfect and opinionated undoubtedly) would provide a counterpoint to this sentiment and make users think harder and deeper before they make the leap.i was also hoping to trigger a discussion (looks like i succeeded). i hope that it takes us to a better design space than what exists currently.

kvs超过 15 年前

@Werner: Darn, someone figured out that Dynamo is a flawed architecture. Luckily its only use is storing hundreds of millions of shopping carts :-)

codeslinger超过 15 年前

Well, this guy works at Facebook on Hive. If there's no hope for him, how much hope do the rest of us have ;-)

codeslinger超过 15 年前

Also, Cassandra is a Dynamo clone? Someone should tell them that...

evgen超过 15 年前

Abridged version: CAP is hard/scary and eventual consistency is too complicated. Let's go shopping.

评论 #915706 未加载

评论 #915716 未加载

tybris超过 15 年前

If by flawed you mean scalable.

11 条评论

werner超过 15 年前

jbellis超过 15 年前

评论 #915699 未加载

评论 #918047 未加载

mbrubeck超过 15 年前

HenryR超过 15 年前

评论 #915366 未加载

评论 #918158 未加载

ynniv超过 15 年前

评论 #915495 未加载

jsensarma超过 15 年前

kvs超过 15 年前

@Werner: Darn, someone figured out that Dynamo is a flawed architecture. Luckily its only use is storing hundreds of millions of shopping carts :-)

codeslinger超过 15 年前

Well, this guy works at Facebook on Hive. If there's no hope for him, how much hope do the rest of us have ;-)

codeslinger超过 15 年前

Also, Cassandra is a Dynamo clone? Someone should tell them that...

evgen超过 15 年前

Abridged version: CAP is hard/scary and eventual consistency is too complicated. Let's go shopping.

评论 #915706 未加载

评论 #915716 未加载

tybris超过 15 年前

If by flawed you mean scalable.