TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Parallel Grouped Aggregation in DuckDB

90 pointsby hfmuehleisenabout 3 years ago

3 comments

RobinLabout 3 years ago
This is great. In terms of real-world uses, I&#x27;m currently working on enabling DuckDB as a backend in Splink[1], software for record linkage at scale. Central to the software is an iterative algorithm (Expectation Maximisation) that performs a large number of group-by aggregations on large tables.<p>Until recently, it was PySpark only, but we&#x27;ve found DuckDB gives us great performance on medium size data. This will be enabled in a forthcoming release (we have an early pre-release demo of duckdb backend[2]). This new DuckDB backend will probably be fast enough for the majority of our users, who don&#x27;t have massive datasets.<p>With this in mind, excited to hear that: &gt; Another large area of future work is to make our aggregate hash table work with out-of-core operations, where an individual hash table no longer fits in memory, this is particularly problematic when merging.<p>This would be an amazing addition. Our users typically need to process sensitive data, and spinning up Spark can be a challenge from an infrastructure perspective. I&#x27;m imagining as we go forwards, more and more will be possible on a single beefy machine that is easily spun up in the cloud.<p>Anyway, really just wanted to say thanks to the DuckDB team for great work - you&#x27;re enabling a lot of value downstream!<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink</a> [2] <a href="https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink_demos&#x2F;tree&#x2F;splink3_demos" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;moj-analytical-services&#x2F;splink_demos&#x2F;tree...</a>
评论 #30592482 未加载
mavamabout 3 years ago
I had chat with Hannes, the DuckDB co-founder, a few weeks ago. They are building awesome stuff to become the &quot;SQLite of OLAP&quot;. The team comes with a strong academic background and is tuned into the data engineering world.<p>At Tenzir, we looked at DuckDB as embeddable backend engine to do the heavy lifting of query execution of our engine [1]. Our idea is throwing over a set of Parquet files, along with a query; initially SQL but perhaps soon Substrait [2] if it picks up.<p>We also experiment with a cloud deployment [3] where a different set of I&#x2F;O path may warrant a different backend engine. Right now, we&#x27;re working on a serverless approach leveraging Datafusion (and depending on maturity, Ballista at some point).<p>My hunch is that we will see more pluggability in this space moving forward. It&#x27;s not only meaningful from an open-core business model perspective, but also pays dividends to the UX. The company that&#x27;s solving a domain problem (for us: security operations center infrastructre) can leverage a high-bandwidth drop-in engine and only needs to wire it properly. This requires much less data engineers than building a poorman&#x27;s version of the same inhouse.<p>We also have the R use case, e.g., to write reports in Rmarkdown that crunch some customer security telemetry, highlighting outliers or other noteworthy events. We&#x27;re not there yet, but with the right query backend, I would expect to get this almost for free. We&#x27;re close to being ready to use Arrow Flight for interop, but it&#x27;s not zero-copy. DuckDB has demonstrated the zero-copy approach recently [4], going through the C API. (The story is also relevant when doing s&#x2F;R&#x2F;Python&#x2F;, FWIW.)<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast</a> [2] <a href="https:&#x2F;&#x2F;github.com&#x2F;substrait-io&#x2F;substrait" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;substrait-io&#x2F;substrait</a> [3] <a href="https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast&#x2F;tree&#x2F;master&#x2F;cloud&#x2F;aws" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;tenzir&#x2F;vast&#x2F;tree&#x2F;master&#x2F;cloud&#x2F;aws</a> [4] <a href="https:&#x2F;&#x2F;duckdb.org&#x2F;2021&#x2F;12&#x2F;03&#x2F;duck-arrow.html" rel="nofollow">https:&#x2F;&#x2F;duckdb.org&#x2F;2021&#x2F;12&#x2F;03&#x2F;duck-arrow.html</a>
评论 #30600245 未加载
keewee7about 3 years ago
Polars+DuckDB beats Pandas.<p>Tidyverse (R) is superior for data exploration but R is not fun to deploy and make complex multi job data pipelines.
评论 #30590567 未加载
评论 #30593990 未加载