TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Solving duplicate data with performant deduplication

45 点作者 goodroot超过 1 年前

5 条评论

goodroot超过 1 年前
Hey! Thanks for upvoting.<p>Happy to answer any questions about deduplication. One thing that&#x27;s not included in the write-up is that we also address out-of-order indexing alongside deduplication.
评论 #38371764 未加载
goenning超过 1 年前
If your ClickHouse ReplacingMergeTree returns twice the expected row count is because your query is wrong. You don’t need to FINAL it, just use aggregation on your queries as per their docs
评论 #38379402 未加载
评论 #38377132 未加载
adren123超过 1 年前
An initial import with DuckDB from all the 15 files takes only 36 seconds on a regular (6 years old) desktop computer with 32GB of RAM and 26 seconds (5 times quicker than QuestDB) on a Dell PowerEdge 450 with 20 cores Intel Xeon and 256GB of RAM.<p>Here is the command to input the files:<p>CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto(&#x27;ecommerce_*.csv&#x27;);
评论 #38391280 未加载
whalesalad超过 1 年前
Can anyone comment on QuestDB vs Clickhouse vs TimescaleDB? Real world experience around ergonomics, ops, etc.<p>Currently using BigQuery for a lot of this (ingesting ~5-10TB monthly) but would like to begin exploring in-house tooling.<p>On the flip side, we still use PSQL&#x2F;RDS a lot and I enjoy it for the low operations burden - but we&#x27;re doing some time series stuff with it now that is starting to fall over. TimescaleDB is nice because it <i>is</i> postgres, but afaik cannot work inside RDS. Clickhouse is next on my list for a test deployment, but QuestDB looks pretty neat too.
评论 #38372361 未加载
评论 #38376363 未加载
评论 #38372066 未加载
评论 #38371718 未加载
jimsimmons超过 1 年前
What is the best way to deduplicate a corpus of documents
评论 #38375301 未加载
评论 #38376356 未加载