科技回声

9 条评论

lopatin7 个月前

I think the competition for the future is between DuckDB and Polars. Will we stick with the DataFrame model, made feasible by Polars's lazy execution, or will we go with in-process SQL a la DuckDB? Personally I've been using DuckDB because I already know SQL (and DuckDB provides persistence if I need it) and don't want to learn a new DataFrame DSL but I'd love to hear other the experience of other people.

评论 #42056992 未加载

评论 #42056987 未加载

评论 #42057024 未加载

评论 #42057414 未加载

ramraj077 个月前

I am just using duckdb on a 3TB dataset in a beefy ec2, and am pleasantly surprised at its performance on such a large table. I had to do some sharding to be sure but am able to match performance of snowflake or other cluster based systems using this single machine instance.<p>To clarify Clickhouse will likely match this performance as well, but doing things on a single machines look sexier to me than it ever did in decades.

评论 #42056796 未加载

评论 #42056803 未加载

minimaxir7 个月前

The test case of a simple aggregation is a good example of an important data science skill knowing when and here to use a given tool, and that there is no one right answer for all cases. Although it's worth noting that DuckDB and polars are comparable performance-wise for aggregation (DuckDB slightly faster: <a href="https://duckdblabs.github.io/db-benchmark/" rel="nofollow">https://duckdblabs.github.io/db-benchmark/</a> ).<p>For my cases with polars and function piping, certain aspects of that workflow are hard to represent in SQL, and additionally it's easier for iteration/testing on a given aggregation to add/remove a given function pipe, and to relate to existing tables (e.g. filter a table to only IDs present in a different table, which is more algorithmically efficient than a join-then-filter). To do the ETL I tend to do for my data science workin pandas/polars in SQL/DuckDB, it would require chains of CTEs or other shenanigans, which eliminates similicity and efficincy.

评论 #42056905 未加载

wodenokoto7 个月前

> Note that DuckDB automatically figured out how to parse the date column.<p>It kinda did and it kinda didn't. Author got lucky that Transaction.csv contained a date where the day was after the 12th in a given month. Had there not been such a date, DuckDB would have gotten the dates wrong and read it as dd/mm/yyyy.<p>I think a warning from DuckDB would have been in order.

knowsuchagency7 个月前

Why not both? <a href="https://ibis-project.org/" rel="nofollow">https://ibis-project.org/</a>

评论 #42078792 未加载

wanderingmind7 个月前

My biggest issue with DuckDB is its not willing to implement edits to blob storages which allow edits (Azure). Having common object/blob storages that can be interacted and operated by multiple process will make it much more amenable to many data science driven workflows.

评论 #42078830 未加载

jgalt2127 个月前

At what database size does it make sense to move from SQLite to DuckDB? My use case is off-line data analysis, not query / response web app.

评论 #42057134 未加载

pietz7 个月前

I don't understand the purpose of this post. "I write a lot of X so I prefer using X over Y." Great.

评论 #42062946 未加载

xiaodai7 个月前

lack of UDF is an issue

评论 #42056653 未加载

9 条评论

lopatin7 个月前

评论 #42056992 未加载

评论 #42056987 未加载

评论 #42057024 未加载

评论 #42057414 未加载

ramraj077 个月前

评论 #42056796 未加载

评论 #42056803 未加载

minimaxir7 个月前

评论 #42056905 未加载

wodenokoto7 个月前

knowsuchagency7 个月前

Why not both? <a href="https://ibis-project.org/" rel="nofollow">https://ibis-project.org/</a>

评论 #42078792 未加载

wanderingmind7 个月前

评论 #42078830 未加载

jgalt2127 个月前

At what database size does it make sense to move from SQLite to DuckDB? My use case is off-line data analysis, not query / response web app.

评论 #42057134 未加载

pietz7 个月前

I don't understand the purpose of this post. "I write a lot of X so I prefer using X over Y." Great.

评论 #42062946 未加载

xiaodai7 个月前

lack of UDF is an issue

评论 #42056653 未加载

DuckDB over Pandas/Polars

9 条评论

DuckDB over Pandas/Polars

9 条评论