TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

DuckDB over Pandas/Polars

56 pointsby pgr0ss7 months ago

9 comments

lopatin7 months ago
I think the competition for the future is between DuckDB and Polars. Will we stick with the DataFrame model, made feasible by Polars's lazy execution, or will we go with in-process SQL a la DuckDB? Personally I've been using DuckDB because I already know SQL (and DuckDB provides persistence if I need it) and don't want to learn a new DataFrame DSL but I'd love to hear other the experience of other people.
评论 #42056992 未加载
评论 #42056987 未加载
评论 #42057024 未加载
评论 #42057414 未加载
ramraj077 months ago
I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.<p>To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.
评论 #42056796 未加载
评论 #42056803 未加载
minimaxir7 months ago
The test case of a simple aggregation is a good example of an important data science skill knowing when and here to use a given tool, and that there is no one right answer for all cases. Although it&#x27;s worth noting that DuckDB and polars are comparable performance-wise for aggregation (DuckDB slightly faster: <a href="https:&#x2F;&#x2F;duckdblabs.github.io&#x2F;db-benchmark&#x2F;" rel="nofollow">https:&#x2F;&#x2F;duckdblabs.github.io&#x2F;db-benchmark&#x2F;</a> ).<p>For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it&#x27;s easier for iteration&#x2F;testing on a given aggregation to add&#x2F;remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas&#x2F;polars in SQL&#x2F;DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.
评论 #42056905 未加载
wodenokoto7 months ago
&gt; Note that DuckDB automatically figured out how to parse the date column.<p>It kinda did and it kinda didn&#x27;t. Author got lucky that Transaction.csv contained a date where the day was after the 12th in a given month. Had there not been such a date, DuckDB would have gotten the dates wrong and read it as dd&#x2F;mm&#x2F;yyyy.<p>I think a warning from DuckDB would have been in order.
knowsuchagency7 months ago
Why not both? <a href="https:&#x2F;&#x2F;ibis-project.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ibis-project.org&#x2F;</a>
评论 #42078792 未加载
wanderingmind7 months ago
My biggest issue with DuckDB is its not willing to implement edits to blob storages which allow edits (Azure). Having common object&#x2F;blob storages that can be interacted and operated by multiple process will make it much more amenable to many data science driven workflows.
评论 #42078830 未加载
jgalt2127 months ago
At what database size does it make sense to move from SQLite to DuckDB? My use case is off-line data analysis, not query &#x2F; response web app.
评论 #42057134 未加载
pietz7 months ago
I don&#x27;t understand the purpose of this post. &quot;I write a lot of X so I prefer using X over Y.&quot; Great.
评论 #42062946 未加载
xiaodai7 months ago
lack of UDF is an issue
评论 #42056653 未加载