TechEcho

TL;DR:Datacenter operators incur a significant cost if after a cluster-wide power outage some nodes fail permanently; finding chunks of data that are lost (i.e. all replicas failed) is a big fixed cost, so it is in their interest to reduce the probability of data loss at the expense of increasing the magnitude of data loss (i.e. you lose data less often, but when you do, you lose more data).> The probability of data loss is minimized when each node is a member of exactly one copyset. For example, assume our system has 9 nodes with R[eplication]= 3 that are split into three copy sets: {1, 2, 3}, {4, 5, 6}, [and] {7, 8, 9}. Our system would only lose data if nodes 1, 2 and 3, nodes 4, 5 and 6 or nodes 7, 8 and 9 fail simultaneously.> In contrast, with random replication and a sufﬁcient number of chunks, any combination of 3 nodes would be a copyset, and any combination of 3 nodes that fail simultaneously would cause data loss.In the case above, in case of single node failure there only 2 other nodes from which new replacement node can bootstrap. They then relax the constraint that one node belongs to only one Copyset, which slightly increases the probability of data loss, but speeds up recovery from partial failure.

It seems that this is just a structured way to formally keep more copies of your data, when what you're trying to avoid is a rack level event removing availability of your replicas.Ceph, described here <a href="http://ceph.com/papers/weil-thesis.pdf" rel="nofollow">http://ceph.com/papers/weil-thesis.pdf</a> does just that by letting you include the structure of your datacenter in the pseudorandom, deterministic placement algorithm it uses for placing reads and writes.

Usenix ATC best student paper award on distributed storage

2 comments

Usenix ATC best student paper award on distributed storage

2 comments