科技回声

8 条评论

jbellis大约 12 年前

This misses the point.There are two main reasons why, when I was researching scalable databases, I primarily gravitated towards Dynamo-style replication (Cassandra, Voldemort, and at the time, Dynomite):- There is no such thing as failover. Dynamo replication takes node failure in stride. This is what you want for a robust system where "Network Partitions are Rare, Server Failures are Not." Not only does it prevent temporary unavailability during the failover, it rules out an entire class of difficult, edge-case bugs. (Which every master-election-and-failover system out there has been plagued with.)- It generalizes to multiple datacenters as easily as to multiple machines, allowing local latencies for reads AND writes, in contrast to master-based systems where you always have to hit the master (possibly cross-DC) for at least writes. (Couchbase is unusual in that it apparently forces read-from-master as well.) Cassandra has pushed this the farthest, allowing you to choose synchronous replication to local replicas and asynchronous to remote ones, for instance: <a href="http://www.datastax.com/docs/1.2/dml/data_consistency" rel="nofollow">http://www.datastax.com/docs/1.2/dml/data_consistency</a>/Cassandra project chair

评论 #5652895 未加载

pbailis大约 12 年前

There's at least one good reason for Dynamo's write-to-all and read-from-all mechanism: latency.What you've called 'W=2' in Couchbase is "write to master and at least one slave." Dynamo-style 'W=2' means "write to any two replicas." This can decrease tail latencies since you don't have to wait for the master--any two will do; similarly for 'R=2'. Indeed, Dynamo 'W=2, R=2' will incur more read load than master-based reads (at least double, but not necessarily triple, in your figures). So I think it's more accurately a trade-off between latency and server load.There can be big benefits to this redundant work. For example: <a href="http://www.bailis.org/blog/doing-redundant-work-to-speed-up-distributed-queries/" rel="nofollow">http://www.bailis.org/blog/doing-redundant-work-to-speed-up-...</a>But don't take it from me: <a href="http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext" rel="nofollow">http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...</a>Anyway, I'm pretty sure CASSANDRA-4705 (<a href="https://issues.apache.org/jira/browse/CASSANDRA-4705" rel="nofollow">https://issues.apache.org/jira/browse/CASSANDRA-4705</a>), which allows for Dean-style redundant requests, both decreases the read load (at least from the factor of N in your post) and should still reduce tail latency without compromising on semantics.I don't have skin in this game, but I'm pretty sure that the Dynamo engineers had a good idea of what they were doing. (That said, the regular [non-linearizable] semantics for R+W>N are sort of annoying compared to a master-slave system, but can be fixed with write-backs.)

评论 #5653710 未加载

gigq大约 12 年前

This also exactly describes how HBase works. I've always preferred HBase to Cassandra for this exact reason. You put far less read load on your servers and you don't have to worry about most of the things on <a href="http://wiki.apache.org/cassandra/Operations" rel="nofollow">http://wiki.apache.org/cassandra/Operations</a>.Another benefit that is not mentioned is that with a master based system you can easily move who is responsible for the data if a server starts to hotspot. In Cassandra you have to use random key distribution because if you have a server hotspot then the only solution is to split the token ring which is an intensive operation that is hard to do while the server is under heavy load.

评论 #5654194 未加载

评论 #5653885 未加载

gukjoon大约 12 年前

Great article, Damien. This idea that network partitions are exceedingly rare was the reason why ElasticSearch goes CA vs. the AP many other NoSQL datastores choose.<a href="http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-tp891925p894234.html" rel="nofollow">http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-...</a>Not only are network partitions rare, the most disastrous case where the cluster splits in half is even rarer. Usually, you have a small part of the cluster partition away.I hope people don't take this as a Dynamo vs. Couch discussion, because the relative importance of partition tolerance is a topic that spans all datastores that give up on ACID.

评论 #5654878 未加载

crb大约 12 年前

When your units of networking concern are "availability zones" (i.e. data centers) rather than just switches, wouldn't network failures now be more common than server failures?

评论 #5654416 未加载

jeremiahjordan大约 12 年前

even if switch failures are rarer, couch at W=1 will silently drop data for network partition, dynamo at W=2 won't, how is the comparison at the end valid?

评论 #5652749 未加载

pkolaczk大约 12 年前

Why compare MTBF of a single network switch to MTBF of a node? Why not compare MTBF of a single network switch to MTBF of a single CPU or a motherboard? Unless you're talking about hobby-size network, there is usually much more between the nodes than a single network switch.

leef大约 12 年前

Despite the name you can't actually assume DynamoDB is based on the Dynamo paper architecture.

评论 #5654408 未加载

评论 #5654102 未加载

8 条评论

jbellis大约 12 年前

评论 #5652895 未加载

pbailis大约 12 年前

评论 #5653710 未加载

gigq大约 12 年前

评论 #5654194 未加载

评论 #5653885 未加载

gukjoon大约 12 年前

评论 #5654878 未加载

crb大约 12 年前

When your units of networking concern are "availability zones" (i.e. data centers) rather than just switches, wouldn't network failures now be more common than server failures?

评论 #5654416 未加载

jeremiahjordan大约 12 年前

even if switch failures are rarer, couch at W=1 will silently drop data for network partition, dynamo at W=2 won't, how is the comparison at the end valid?

评论 #5652749 未加载

pkolaczk大约 12 年前

leef大约 12 年前

Despite the name you can't actually assume DynamoDB is based on the Dynamo paper architecture.

评论 #5654408 未加载

评论 #5654102 未加载

Dynamo Systems Work Too Hard

8 条评论

Dynamo Systems Work Too Hard

8 条评论