TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

A Distributed Real-Time Data Store with Flexible Deduplication

46 点作者 prospero超过 8 年前

3 条评论

alexatkeplar超过 8 年前
There are other reasons for duplicates in event streams - not just the dupes introduced by at-least once processing in Kinesis or Kafka workers. We&#x27;ve done a lot of thinking about this (all open-source) at Snowplow, this is a good starting point:<p><a href="http:&#x2F;&#x2F;snowplowanalytics.com&#x2F;blog&#x2F;2015&#x2F;08&#x2F;19&#x2F;dealing-with-duplicate-event-ids&#x2F;" rel="nofollow">http:&#x2F;&#x2F;snowplowanalytics.com&#x2F;blog&#x2F;2015&#x2F;08&#x2F;19&#x2F;dealing-with-du...</a><p>Our last release started to tackle dupes caused by bots, spiders and dodgy UUID algos:<p><a href="http:&#x2F;&#x2F;snowplowanalytics.com&#x2F;blog&#x2F;2016&#x2F;12&#x2F;20&#x2F;snowplow-r86-petra-released&#x2F;#synthetic-dedupe" rel="nofollow">http:&#x2F;&#x2F;snowplowanalytics.com&#x2F;blog&#x2F;2016&#x2F;12&#x2F;20&#x2F;snowplow-r86-pe...</a>
评论 #13447335 未加载
csears超过 8 年前
I would be curious to know if they evaluated any cloud-based data stores or streaming services from AWS or GCP before deciding to building this from scratch. It seems like a common set of requirements for event analytics pipelines.
评论 #13447346 未加载
评论 #13447013 未加载
julienmarie超过 8 年前
It reminds me exactly of the common architecture pattern of KDB&#x2F;Q. Still at this point, it&#x27;s a marvel of tech.