科技回声

4 条评论

Keeping duplication metadata around doesn't scale well, though it may be sufficient for their typical case. A strategy I've seen many times that scales better is to (1) only check for duplicates when some part of the system has a reason to believe duplicates could occur, with the assumption that duplicates are relatively infrequent, and (2) design the storage engine such that you can directly search for the record in previously ingested data inexpensively without any extra deduplication metadata. Note that this is not as easy to achieve if your data infrastructure is a loosely coupled collection of arbitrary storage engines, databases, and processing pipelines -- which may be a practical limitation for the case in the article.If the storage engine is well-designed for the data model, a duplicate check against existing data should only touch a handful of pages in the worst case, it is an inexpensive query that rarely or never touches the network (detail depending). For ingestion of records where there is no risk of duplication, presumably the bulk of the time, this is a zero overhead model as there is no duplication state to be maintained or checked. For most scenarios that create potential duplication, this model is also quite cache friendly as a practical matter.The pathological case for this design is when you need to check every single record for duplication (e.g. cleaning up a giant offline mess of agglomerated data that may contain an arbitrary number of duplicates), but those scenarios usually don't involve real-time stream ingestion.

评论 #20482110 未加载

rbranson将近 6 年前

A counterpoint is our ingest-time deduplication system at Segment: <a href="https://segment.com/blog/exactly-once-delivery/" rel="nofollow">https://segment.com/blog/exactly-once-delivery/</a>It's done at ingest time because Segment has a completely different use case. Message data fans out over multiple downstream systems, some of which distribute this data to systems outside of our control. However, if I were in Mixpanel's shoes, I'd probably do it how they're describing it here.

评论 #20480324 未加载

ryanworl将近 6 年前

"For query-time, we found that reading an extra bit for every event adds around 10ns to the reading of data. This is close to a 2% increase in the query time because of the additional column."This seems somewhat more expensive than I would've expected. Given your estimates of duplicate probability, the bitset should compress to essentially nothing, so IO is probably not the issue unless you're not compressing it.Are you doing a virtual function call or something here?

sethammons将近 6 年前

Our deduplication system needs to happen at our outgoing edge and is latency sensitive (have we sent this message to its recipient already?). It needs to do this a couple billion times a day and be highly available. Interesting problem space. We are redesigning our current solution to be more robust soon.

4 条评论

jandrewrogers将近 6 年前

评论 #20482110 未加载

rbranson将近 6 年前

评论 #20480324 未加载

ryanworl将近 6 年前

sethammons将近 6 年前

Data Deduplication at Scale

4 条评论

Data Deduplication at Scale

4 条评论