TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

DuckDB over Pandas/Polars

56 点作者 pgr0ss7 个月前

9 条评论

lopatin7 个月前
I think the competition for the future is between DuckDB and Polars. Will we stick with the DataFrame model, made feasible by Polars's lazy execution, or will we go with in-process SQL a la DuckDB? Personally I've been using DuckDB because I already know SQL (and DuckDB provides persistence if I need it) and don't want to learn a new DataFrame DSL but I'd love to hear other the experience of other people.
评论 #42056992 未加载
评论 #42056987 未加载
评论 #42057024 未加载
评论 #42057414 未加载
ramraj077 个月前
I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.<p>To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.
评论 #42056796 未加载
评论 #42056803 未加载
minimaxir7 个月前
The test case of a simple aggregation is a good example of an important data science skill knowing when and here to use a given tool, and that there is no one right answer for all cases. Although it&#x27;s worth noting that DuckDB and polars are comparable performance-wise for aggregation (DuckDB slightly faster: <a href="https:&#x2F;&#x2F;duckdblabs.github.io&#x2F;db-benchmark&#x2F;" rel="nofollow">https:&#x2F;&#x2F;duckdblabs.github.io&#x2F;db-benchmark&#x2F;</a> ).<p>For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it&#x27;s easier for iteration&#x2F;testing on a given aggregation to add&#x2F;remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas&#x2F;polars in SQL&#x2F;DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.
评论 #42056905 未加载
wodenokoto7 个月前
&gt; Note that DuckDB automatically figured out how to parse the date column.<p>It kinda did and it kinda didn&#x27;t. Author got lucky that Transaction.csv contained a date where the day was after the 12th in a given month. Had there not been such a date, DuckDB would have gotten the dates wrong and read it as dd&#x2F;mm&#x2F;yyyy.<p>I think a warning from DuckDB would have been in order.
knowsuchagency7 个月前
Why not both? <a href="https:&#x2F;&#x2F;ibis-project.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ibis-project.org&#x2F;</a>
评论 #42078792 未加载
wanderingmind7 个月前
My biggest issue with DuckDB is its not willing to implement edits to blob storages which allow edits (Azure). Having common object&#x2F;blob storages that can be interacted and operated by multiple process will make it much more amenable to many data science driven workflows.
评论 #42078830 未加载
jgalt2127 个月前
At what database size does it make sense to move from SQLite to DuckDB? My use case is off-line data analysis, not query &#x2F; response web app.
评论 #42057134 未加载
pietz7 个月前
I don&#x27;t understand the purpose of this post. &quot;I write a lot of X so I prefer using X over Y.&quot; Great.
评论 #42062946 未加载
xiaodai7 个月前
lack of UDF is an issue
评论 #42056653 未加载