TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Data Deduplication at Scale

84 点作者 i0exception将近 6 年前

4 条评论

jandrewrogers将近 6 年前
Keeping duplication metadata around doesn&#x27;t scale well, though it may be sufficient for their typical case. A strategy I&#x27;ve seen many times that scales better is to (1) only check for duplicates when some part of the system has a reason to believe duplicates could occur, with the assumption that duplicates are relatively infrequent, and (2) design the storage engine such that you can directly search for the record in previously ingested data inexpensively without any extra deduplication metadata. Note that this is not as easy to achieve if your data infrastructure is a loosely coupled collection of arbitrary storage engines, databases, and processing pipelines -- which may be a practical limitation for the case in the article.<p>If the storage engine is well-designed for the data model, a duplicate check against existing data should only touch a handful of pages in the worst case, it is an inexpensive query that rarely or never touches the network (detail depending). For ingestion of records where there is no risk of duplication, presumably the bulk of the time, this is a zero overhead model as there is no duplication state to be maintained or checked. For most scenarios that create potential duplication, this model is also quite cache friendly as a practical matter.<p>The pathological case for this design is when you need to check every single record for duplication (e.g. cleaning up a giant offline mess of agglomerated data that may contain an arbitrary number of duplicates), but those scenarios usually don&#x27;t involve real-time stream ingestion.
评论 #20482110 未加载
rbranson将近 6 年前
A counterpoint is our ingest-time deduplication system at Segment: <a href="https:&#x2F;&#x2F;segment.com&#x2F;blog&#x2F;exactly-once-delivery&#x2F;" rel="nofollow">https:&#x2F;&#x2F;segment.com&#x2F;blog&#x2F;exactly-once-delivery&#x2F;</a><p>It&#x27;s done at ingest time because Segment has a completely different use case. Message data fans out over multiple downstream systems, some of which distribute this data to systems outside of our control. However, if I were in Mixpanel&#x27;s shoes, I&#x27;d probably do it how they&#x27;re describing it here.
评论 #20480324 未加载
ryanworl将近 6 年前
&quot;For query-time, we found that reading an extra bit for every event adds around 10ns to the reading of data. This is close to a 2% increase in the query time because of the additional column.&quot;<p>This seems somewhat more expensive than I would&#x27;ve expected. Given your estimates of duplicate probability, the bitset should compress to essentially nothing, so IO is probably not the issue unless you&#x27;re not compressing it.<p>Are you doing a virtual function call or something here?
sethammons将近 6 年前
Our deduplication system needs to happen at our outgoing edge and is latency sensitive (have we sent this message to its recipient already?). It needs to do this a couple billion times a day and be highly available. Interesting problem space. We are redesigning our current solution to be more robust soon.