科技回声

8 条评论

ot大约 1 年前

It is worth noting that this does not come for free, and it would have been nice for the article to mention the trade-off: reconstruction is not cheap on CPU, if you use something like Reed-Solomon.Usually the codes used for erasure coding are in systematic form: there are k "preferential" parts out of M that are just literal fragments of the original blob, so if you get those you can just concatenate them to get the original data. If you get any other k-subset, you need to perform expensive reconstruction.

评论 #39852894 未加载

评论 #39858792 未加载

dmw_ng大约 1 年前

Nice to see this idea written about in detail. I had thought about it in the context of terrible availability bargain bucket storage (iDrive E2), where the cost of (3,2) erasure coding an object and distributing each segment to one of 3 regions would still be dramatically lower than paying for more expensive and more reliable storage.Say 1 chunk lives in Germany, Ireland and the US each. Client races GETs to all 3 regions and cancels the request to the slowest to respond (which may also be down). Final client latency is equivalent to that of the 2nd slowest region, with substantially better availability due to the ability to tolerate any single region being downStill wouldn't recommend using E2 for anything important, but ^ was one potential approach to dealing with its terribleness. It still doesn't address the reality of when E2 regions go down, it is often for days and reportedly sometimes weeks at a time. So reliable writing in this scenario would necessitate some kind of queue with capacity for weeks of storageThere are variants of this scheme where you could potentially balance the horrible reliability storage with some expensive reliable storage as part of the same system, but I never got that far in thinking about how it would work

评论 #39853401 未加载

评论 #39853052 未加载

sujayakar大约 1 年前

this is a really cool idea.one followup I was thinking of is whether this can generalize to queries other than key value point lookups. if I'm understanding correctly, the article is suggesting to take a key value store, and for every `(key, value)` in the system, split `value` into fragments that are stored on different shards with some `k` of `M` code. then at query time, we can split a query for `key` into `k` subqueries that we send to the relevant shards and reassemble the query results into `value`.so, if we were to do the same business for an ordered map with range queries, we'd need to find a way to turn a query for `interval: [start, end]` into some number of subqueries that we could send to the different shards and reassemble into the final result. any ideas?

评论 #39853662 未加载

评论 #39853650 未加载

loeg大约 1 年前

Yeah. And you get the storage for free if your distributed design also uses the erasure-encoded chunks for durability. Facebook's Warm Storage infrastructure does something very similar to what this article describes.

benlivengood大约 1 年前

The next level of efficiency is using nested erasure codes. The outer code can be across regions/zones/machines/disks while the inner code is across chunks of a stripe. Chunk unavailability is fast to correct with an extra outer chunk and bit rot or corruption can be fixed by the inner code without an extra fetch. In the fast path only data chunks need to be fetched.

评论 #39855482 未加载

siscia大约 1 年前

Nice to see this talked about here and Marc being public about it.AWS is such a big place that even after a bit of tenure you still got place to look to find interesting technical approaches and when I was introduced to this schema for Lambda storage I was surprised.As Marc mentions it is such a simple and powerful idea that is definitely not mentioned enough.

评论 #39852848 未加载

ghusbands大约 1 年前

The first graph is incredibly misleading. The text talks about fetching from 5 servers and needing 4 results vs fetching from 1 server and needing 1 result. Then the graph compares 4-of-5 to 4-of-4 latency, which is just meaningless. It should compare 4-of-5 with 1-of-1.

jeffbee大约 1 年前

I do not follow. How is it possible that the latency is lower in a 4-of-5 read of a coded stripe compared to a 1-of-4 replicated stripe?

评论 #39854908 未加载

评论 #39854047 未加载

8 条评论

ot大约 1 年前

评论 #39852894 未加载

评论 #39858792 未加载

dmw_ng大约 1 年前

评论 #39853401 未加载

评论 #39853052 未加载

sujayakar大约 1 年前

评论 #39853662 未加载

评论 #39853650 未加载

loeg大约 1 年前

benlivengood大约 1 年前

评论 #39855482 未加载

siscia大约 1 年前

评论 #39852848 未加载

ghusbands大约 1 年前

jeffbee大约 1 年前

I do not follow. How is it possible that the latency is lower in a 4-of-5 read of a coded stripe compared to a 1-of-4 replicated stripe?

评论 #39854908 未加载

评论 #39854047 未加载

Erasure Coding versus Tail Latency

8 条评论

Erasure Coding versus Tail Latency

8 条评论