This is great. In terms of real-world uses, I'm currently working on enabling DuckDB as a backend in Splink[1], software for record linkage at scale. Central to the software is an iterative algorithm (Expectation Maximisation) that performs a large number of group-by aggregations on large tables.<p>Until recently, it was PySpark only, but we've found DuckDB gives us great performance on medium size data. This will be enabled in a forthcoming release (we have an early pre-release demo of duckdb backend[2]). This new DuckDB backend will probably be fast enough for the majority of our users, who don't have massive datasets.<p>With this in mind, excited to hear that:
> Another large area of future work is to make our aggregate hash table work with out-of-core operations, where an individual hash table no longer fits in memory, this is particularly problematic when merging.<p>This would be an amazing addition. Our users typically need to process sensitive data, and spinning up Spark can be a challenge from an infrastructure perspective. I'm imagining as we go forwards, more and more will be possible on a single beefy machine that is easily spun up in the cloud.<p>Anyway, really just wanted to say thanks to the DuckDB team for great work - you're enabling a lot of value downstream!<p>[1] <a href="https://github.com/moj-analytical-services/splink" rel="nofollow">https://github.com/moj-analytical-services/splink</a>
[2] <a href="https://github.com/moj-analytical-services/splink_demos/tree/splink3_demos" rel="nofollow">https://github.com/moj-analytical-services/splink_demos/tree...</a>
I had chat with Hannes, the DuckDB co-founder, a few weeks ago. They are building awesome stuff to become the "SQLite of OLAP". The team comes with a strong academic background and is tuned into the data engineering world.<p>At Tenzir, we looked at DuckDB as embeddable backend engine to do the heavy lifting of query execution of our engine [1]. Our idea is throwing over a set of Parquet files, along with a query; initially SQL but perhaps soon Substrait [2] if it picks up.<p>We also experiment with a cloud deployment [3] where a different set of I/O path may warrant a different backend engine. Right now, we're working on a serverless approach leveraging Datafusion (and depending on maturity, Ballista at some point).<p>My hunch is that we will see more pluggability in this space moving forward. It's not only meaningful from an open-core business model perspective, but also pays dividends to the UX. The company that's solving a domain problem (for us: security operations center infrastructre) can leverage a high-bandwidth drop-in engine and only needs to wire it properly. This requires much less data engineers than building a poorman's version of the same inhouse.<p>We also have the R use case, e.g., to write reports in Rmarkdown that crunch some customer security telemetry, highlighting outliers or other noteworthy events. We're not there yet, but with the right query backend, I would expect to get this almost for free. We're close to being ready to use Arrow Flight for interop, but it's not zero-copy. DuckDB has demonstrated the zero-copy approach recently [4], going through the C API. (The story is also relevant when doing s/R/Python/, FWIW.)<p>[1] <a href="https://github.com/tenzir/vast" rel="nofollow">https://github.com/tenzir/vast</a>
[2] <a href="https://github.com/substrait-io/substrait" rel="nofollow">https://github.com/substrait-io/substrait</a>
[3] <a href="https://github.com/tenzir/vast/tree/master/cloud/aws" rel="nofollow">https://github.com/tenzir/vast/tree/master/cloud/aws</a>
[4] <a href="https://duckdb.org/2021/12/03/duck-arrow.html" rel="nofollow">https://duckdb.org/2021/12/03/duck-arrow.html</a>