TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Parallel Grouped Aggregation in DuckDB

90 点作者 hfmuehleisen大约 3 年前

3 条评论

RobinL大约 3 年前
This is great. In terms of real-world uses, I&#x27;m currently working on enabling DuckDB as a backend in Splink[1], software for record linkage at scale. Central to the software is an iterative algorithm (Expectation Maximisation) that performs a large number of group-by aggregations on large tables.<p>Until recently, it was PySpark only, but we&#x27;ve found DuckDB gives us great performance on medium size data. This will be enabled in a forthcoming release (we have an early pre-release demo of duckdb backend[2]). This new DuckDB backend will probably be fast enough for the majority of our users, who don&#x27;t have massive datasets.<p>With this in mind, excited to hear that: &gt; Another large area of future work is to make our aggregate hash table work with out-of-core operations, where an individual hash table no longer fits in memory, this is particularly problematic when merging.<p>This would be an amazing addition. Our users typically need to process sensitive data, and spinning up Spark can be a challenge from an infrastructure perspective. I&#x27;m imagining as we go forwards, more and more will be possible on a single beefy machine that is easily spun up in the cloud.<p>Anyway, really just wanted to say thanks to the DuckDB team for great work - you&#x27;re enabling a lot of value downstream!<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink</a> [2] <a href="https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink_demos&#x2F;tree&#x2F;splink3_demos" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink_demos&#x2F;tree...</a>
评论 #30592482 未加载
mavam大约 3 年前
I had chat with Hannes, the DuckDB co-founder, a few weeks ago. They are building awesome stuff to become the &quot;SQLite of OLAP&quot;. The team comes with a strong academic background and is tuned into the data engineering world.<p>At Tenzir, we looked at DuckDB as embeddable backend engine to do the heavy lifting of query execution of our engine [1]. Our idea is throwing over a set of Parquet files, along with a query; initially SQL but perhaps soon Substrait [2] if it picks up.<p>We also experiment with a cloud deployment [3] where a different set of I&#x2F;O path may warrant a different backend engine. Right now, we&#x27;re working on a serverless approach leveraging Datafusion (and depending on maturity, Ballista at some point).<p>My hunch is that we will see more pluggability in this space moving forward. It&#x27;s not only meaningful from an open-core business model perspective, but also pays dividends to the UX. The company that&#x27;s solving a domain problem (for us: security operations center infrastructre) can leverage a high-bandwidth drop-in engine and only needs to wire it properly. This requires much less data engineers than building a poorman&#x27;s version of the same inhouse.<p>We also have the R use case, e.g., to write reports in Rmarkdown that crunch some customer security telemetry, highlighting outliers or other noteworthy events. We&#x27;re not there yet, but with the right query backend, I would expect to get this almost for free. We&#x27;re close to being ready to use Arrow Flight for interop, but it&#x27;s not zero-copy. DuckDB has demonstrated the zero-copy approach recently [4], going through the C API. (The story is also relevant when doing s&#x2F;R&#x2F;Python&#x2F;, FWIW.)<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast</a> [2] <a href="https:&#x2F;&#x2F;github.com&#x2F;substrait-io&#x2F;substrait" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;substrait-io&#x2F;substrait</a> [3] <a href="https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast&#x2F;tree&#x2F;master&#x2F;cloud&#x2F;aws" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast&#x2F;tree&#x2F;master&#x2F;cloud&#x2F;aws</a> [4] <a href="https:&#x2F;&#x2F;duckdb.org&#x2F;2021&#x2F;12&#x2F;03&#x2F;duck-arrow.html" rel="nofollow">https:&#x2F;&#x2F;duckdb.org&#x2F;2021&#x2F;12&#x2F;03&#x2F;duck-arrow.html</a>
评论 #30600245 未加载
keewee7大约 3 年前
Polars+DuckDB beats Pandas.<p>Tidyverse (R) is superior for data exploration but R is not fun to deploy and make complex multi job data pipelines.
评论 #30590567 未加载
评论 #30593990 未加载