TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

DuckDB quacks Arrow: A zero-copy data integration between Arrow and DuckDB

96 点作者 hfmuehleisen超过 3 年前

4 条评论

wenc超过 3 年前
Props to the DuckDB team. I&#x27;ve been using DuckDB for the last little while since I discovered it on HN and it&#x27;s been simply amazing.<p>Before, I would reach for Apache Spark to run queries on local Parquet datasets (on a single machine), but I&#x27;ve started using DuckDB for that and it&#x27;s super fast (much faster than Pandas) and unfussy to integrate in Python code (with PySpark you need all kinds of boilerplate code).<p>DuckDB is so lightweight that it&#x27;s also great for quick interactive work in Jupyter or IPython.<p>I also use it to do cross-format joins between Parquet (immutable) and CSV files (mutable) -- DuckDB can load both into the same environment -- which makes it easy to solve different kinds of programming problems. The dynamic stuff goes into the CSV files while the Parquet dataset remains static. For instance, when processing a large Parquet dataset, my code keeps track of groups that I&#x27;vee already processed in a CSV file. If the program is interrupted and I need to resume from where I left off, I just do a DuckDB join between Parquet and CSV and exclude already processed groups (particularly when you don&#x27;t have a group key, and you&#x27;re grouping by several fields). Yes, you can do all this with Spark too, but the DuckDB code is so much simpler and compact.<p>For large Parquet datasets, I currently roll my own code to chunk them so I can process them out-of-core, but sounds like this latest streaming feature in DuckDB takes care of that detail.<p>Sure, DuckDB doesn&#x27;t do distributed compute like Spark, but as a SQL engine for Parquet, I find it&#x27;s so much more ergonomic than Spark.
polskibus超过 3 年前
How does this compare to Postgres + parquet FDW? Is zero copy feasible in Postgres with FDWs?
评论 #29439114 未加载
评论 #29436551 未加载
评论 #29436812 未加载
waynesonfire超过 3 年前
&quot;In-process, serverless&quot; lol.. who&#x27;s drinking this..
评论 #29439095 未加载
ekzhu超过 3 年前
TLDR: Arrow got an SQL interface provided by DuckDB.<p>So you have a new way to run SQL on Parquet et al through DuckDB -&gt; Arrow -&gt; Parquet. Of course, you still need to watch out for memory usage of your SQL query if it contains JOINs or Window functions because the integration is designed for streaming rows.
评论 #29437221 未加载