科技回声

4 条评论

wenc超过 3 年前

Props to the DuckDB team. I've been using DuckDB for the last little while since I discovered it on HN and it's been simply amazing.Before, I would reach for Apache Spark to run queries on local Parquet datasets (on a single machine), but I've started using DuckDB for that and it's super fast (much faster than Pandas) and unfussy to integrate in Python code (with PySpark you need all kinds of boilerplate code).DuckDB is so lightweight that it's also great for quick interactive work in Jupyter or IPython.I also use it to do cross-format joins between Parquet (immutable) and CSV files (mutable) -- DuckDB can load both into the same environment -- which makes it easy to solve different kinds of programming problems. The dynamic stuff goes into the CSV files while the Parquet dataset remains static. For instance, when processing a large Parquet dataset, my code keeps track of groups that I'vee already processed in a CSV file. If the program is interrupted and I need to resume from where I left off, I just do a DuckDB join between Parquet and CSV and exclude already processed groups (particularly when you don't have a group key, and you're grouping by several fields). Yes, you can do all this with Spark too, but the DuckDB code is so much simpler and compact.For large Parquet datasets, I currently roll my own code to chunk them so I can process them out-of-core, but sounds like this latest streaming feature in DuckDB takes care of that detail.Sure, DuckDB doesn't do distributed compute like Spark, but as a SQL engine for Parquet, I find it's so much more ergonomic than Spark.

polskibus超过 3 年前

How does this compare to Postgres + parquet FDW? Is zero copy feasible in Postgres with FDWs?

评论 #29439114 未加载

评论 #29436551 未加载

评论 #29436812 未加载

waynesonfire超过 3 年前

"In-process, serverless" lol.. who's drinking this..

评论 #29439095 未加载

ekzhu超过 3 年前

TLDR: Arrow got an SQL interface provided by DuckDB.So you have a new way to run SQL on Parquet et al through DuckDB -> Arrow -> Parquet. Of course, you still need to watch out for memory usage of your SQL query if it contains JOINs or Window functions because the integration is designed for streaming rows.

评论 #29437221 未加载

4 条评论

wenc超过 3 年前

polskibus超过 3 年前

How does this compare to Postgres + parquet FDW? Is zero copy feasible in Postgres with FDWs?

DuckDB quacks Arrow: A zero-copy data integration between Arrow and DuckDB

4 条评论

DuckDB quacks Arrow: A zero-copy data integration between Arrow and DuckDB

4 条评论