TechEcho

3 comments

RobinLabout 3 years ago

This is great. In terms of real-world uses, I'm currently working on enabling DuckDB as a backend in Splink[1], software for record linkage at scale. Central to the software is an iterative algorithm (Expectation Maximisation) that performs a large number of group-by aggregations on large tables.Until recently, it was PySpark only, but we've found DuckDB gives us great performance on medium size data. This will be enabled in a forthcoming release (we have an early pre-release demo of duckdb backend[2]). This new DuckDB backend will probably be fast enough for the majority of our users, who don't have massive datasets.With this in mind, excited to hear that: > Another large area of future work is to make our aggregate hash table work with out-of-core operations, where an individual hash table no longer fits in memory, this is particularly problematic when merging.This would be an amazing addition. Our users typically need to process sensitive data, and spinning up Spark can be a challenge from an infrastructure perspective. I'm imagining as we go forwards, more and more will be possible on a single beefy machine that is easily spun up in the cloud.Anyway, really just wanted to say thanks to the DuckDB team for great work - you're enabling a lot of value downstream![1] <a href="https://github.com/moj-analytical-services/splink" rel="nofollow">https://github.com/moj-analytical-services/splink</a> [2] <a href="https://github.com/moj-analytical-services/splink_demos/tree/splink3_demos" rel="nofollow">https://github.com/moj-analytical-services/splink_demos/tree...</a>

评论 #30592482 未加载

mavamabout 3 years ago

I had chat with Hannes, the DuckDB co-founder, a few weeks ago. They are building awesome stuff to become the "SQLite of OLAP". The team comes with a strong academic background and is tuned into the data engineering world.At Tenzir, we looked at DuckDB as embeddable backend engine to do the heavy lifting of query execution of our engine [1]. Our idea is throwing over a set of Parquet files, along with a query; initially SQL but perhaps soon Substrait [2] if it picks up.We also experiment with a cloud deployment [3] where a different set of I/O path may warrant a different backend engine. Right now, we're working on a serverless approach leveraging Datafusion (and depending on maturity, Ballista at some point).My hunch is that we will see more pluggability in this space moving forward. It's not only meaningful from an open-core business model perspective, but also pays dividends to the UX. The company that's solving a domain problem (for us: security operations center infrastructre) can leverage a high-bandwidth drop-in engine and only needs to wire it properly. This requires much less data engineers than building a poorman's version of the same inhouse.We also have the R use case, e.g., to write reports in Rmarkdown that crunch some customer security telemetry, highlighting outliers or other noteworthy events. We're not there yet, but with the right query backend, I would expect to get this almost for free. We're close to being ready to use Arrow Flight for interop, but it's not zero-copy. DuckDB has demonstrated the zero-copy approach recently [4], going through the C API. (The story is also relevant when doing s/R/Python/, FWIW.)[1] <a href="https://github.com/tenzir/vast" rel="nofollow">https://github.com/tenzir/vast</a> [2] <a href="https://github.com/substrait-io/substrait" rel="nofollow">https://github.com/substrait-io/substrait</a> [3] <a href="https://github.com/tenzir/vast/tree/master/cloud/aws" rel="nofollow">https://github.com/tenzir/vast/tree/master/cloud/aws</a> [4] <a href="https://duckdb.org/2021/12/03/duck-arrow.html" rel="nofollow">https://duckdb.org/2021/12/03/duck-arrow.html</a>

评论 #30600245 未加载

keewee7about 3 years ago

Polars+DuckDB beats Pandas.Tidyverse (R) is superior for data exploration but R is not fun to deploy and make complex multi job data pipelines.

评论 #30590567 未加载

评论 #30593990 未加载

3 comments

RobinLabout 3 years ago

评论 #30592482 未加载

mavamabout 3 years ago

评论 #30600245 未加载

keewee7about 3 years ago

Polars+DuckDB beats Pandas.Tidyverse (R) is superior for data exploration but R is not fun to deploy and make complex multi job data pipelines.

评论 #30590567 未加载

评论 #30593990 未加载

Parallel Grouped Aggregation in DuckDB

3 comments

Parallel Grouped Aggregation in DuckDB

3 comments