Updates to the H2O.ai db-benchmark

192 点作者 vgt超过 1 年前

19 条评论

dang超过 1 年前

Recent and related:DuckDB 0.9 - <a href="https://news.ycombinator.com/item?id=37657736">https://news.ycombinator.com/item?id=37657736</a> - Sept 2023 (59 comments)

评论 #38171492 未加载

treesciencebot超过 1 年前

Would be curious how the performance compares to DataFusion[0] as one of the top contenders to DuckDB on this area (albeit they being different in a lot of parts, I find it one of the closest compared to all others).ClickBench (from ClickHouse) has some benchmarks[1] where it can be compared, but am not super sure how up to date it is. At least a while back, they were majorly out of date and haven't looked too closely on whether they are keeping it fair for everyone else :)[0]: <a href="https://github.com/apache/arrow-datafusion">https://github.com/apache/arrow-datafusion</a>[1]: <a href="https://benchmark.clickhouse.com" rel="nofollow noreferrer">https://benchmark.clickhouse.com</a>

评论 #38164658 未加载

评论 #38165346 未加载

评论 #38164831 未加载

评论 #38166566 未加载

wenc超过 1 年前

Fantastic job by the DuckDB team. I’ve been using it for the past year to query 100s of GBs of Parquet files with complex analytic queries involving multiple levels of aggregations, joins and window functions and it all works and works fast. And I do all this from Jupyter Notebook.It’s actually faster than AWS Athena for me.

评论 #38165233 未加载

评论 #38164836 未加载

评论 #38167812 未加载

评论 #38166124 未加载

评论 #38165920 未加载

sega_sai超过 1 年前

I was/am a fan of duckdb, but I recently discovered a bug in 0.9.1 where a fairly innocuous query was silently returning wrong results (issue 9399 on github). That made me much less confident about duckdb and how well tested it is. Maybe it was a one off, but with postgresql for example I don't think I personally encountered cases of simply incorrect query results.

评论 #38166444 未加载

评论 #38167788 未加载

MrPowers超过 1 年前

I think these benchmarks are great, but also quite misleading and should be updated:* the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.* the benchmarks only show the query runtime once the data has been persisted in memory. They should also show how long it takes to persist the data in memory. If query_engine_A takes 5 mins to persist in memory & 10 seconds to run the query and query_engine_B takes 2 mins to persist in memory & 20 seconds to run the query, then the amount of time to persist the data is highly relevant.* benchmarks should also show results when the data isn't persisted in memory.* Using a Parquet file with column pruning would make a lot more sense than a huge CSV file. The groupby dataset has 9 columns and some of the queries only require 3 columns. Needlessly persisting 6 columns in memory is really misleading for some engines.* Seems like some of the engines have queries that are more optimized than others. Some have explicitly casted columns as int32 and presumably others are int64. The queries should be apples:apples across engines.* Some engines are parallel and lazy. "Running" some of these queries is hard because lazy engines don't want to do work unless they have to. The authors have forced some of these queries to run by persisting in memory, which is another step, so that should be investigated.* There are obvious missing query types like filtering and "compound queries" like filter, join, then aggregate.I like these benchmarks a lot and use the h2o datasets locally all the time, but the methodology really needs to be modernized. At the bottom you can see "Benchmark run took around 105.3 hours." This is way to slow and there are some obvious fixes that'll make the results more useful for the data community.

评论 #38167074 未加载

评论 #38166924 未加载

sroerick超过 1 年前

I really love DuckDB for one-offs and analytics and I wonder if anybody here has experience using with medium size data.I still seem to run into the workflow problem where data has to be in proximity to compute in order to function.If I need to run joins on a 5-10 GB parquet / table, unless I have that sitting locally, the performance bottleneck is not the database.I still find myself reaching to Databricks / Spark for most tasks for this reason.I suppose this is what Motherduck is trying to solve? But it just doesn't feel like it's quite there yet for me. Anybody who is better at this stuff than me have thoughts?

评论 #38164908 未加载

评论 #38167929 未加载

alamb超过 1 年前

I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.[1] <a href="https://h2oai.github.io/db-benchmark/" rel="nofollow noreferrer">https://h2oai.github.io/db-benchmark/</a>

spapas82超过 1 年前

If you have some data in postgresql and want to query it with duckdb (really fast) you can try extracting the data to a parquet file; this file can then be queried from duckdb with incredible speed. I've written a small program in python that reads from postgresql and exports to parquet for anybody that wanna try it <a href="https://github.com/spapas/pg-parquet-py#why">https://github.com/spapas/pg-parquet-py#why</a>

评论 #38178901 未加载

vgt超过 1 年前

We at MotherDuck are working closely with the DuckDB folks to deliver a serverless analytics service powered by DuckDB [0]. Currently in open Beta and driving towards GA. Our users are certainly recognizing how fast DuckDB is in the cloud.One of the reasons we exist is because DuckDB is meant to be a single-player database. MotherDuck is doing tons of heavy-lifting to turn it into a true multi-player data warehouse, so things like IAM/sharing, persistence, time travel, administration, the ecosystem and so forth.What's magical about MotherDuck is that virtually any DuckDB instance in the wild can connect to MotherDuck by simply running '.open motherduck:' [1], and suddenly you get all these aforementioned benefits.(head of produck at MotherDuck)[0] <a href="https://motherduck.com/" rel="nofollow noreferrer">https://motherduck.com/</a>[1] <a href="https://motherduck.com/docs/getting-started/connect-query-from-python/installation-authentication#authenticating-to-motherduck" rel="nofollow noreferrer">https://motherduck.com/docs/getting-started/connect-query-fr...</a>

tristenharr超过 1 年前

Nice job DuckDB team, those are great performance improvements compared to a couple years ago. It’s neat that solutions like this are being offered and developed. At Hasura we’ve been working on a data-connector that’ll wrap around DuckDB. Curious what others would think about having a GraphQL layer on JSON/Parquet/CSV via DuckDB?

indeedmug超过 1 年前

I am impressed that Polar is close to DuckDB near the top. It's surprising that a Python library would often out perform everything but DuckDB. DuckDB is very impressive but DataFrames and Python is too useful to give up on.

评论 #38166349 未加载

评论 #38168812 未加载

评论 #38166332 未加载

评论 #38166353 未加载

cube2222超过 1 年前

It would be interesting to see some more benchmarks of e.g. querying multiple files from S3, and how that evolved across versions.When I checked at 0.7.1, when working with ~90 S3 parquet objects (x0000 rows each, so not too many) it was 25-50% faster to first download them in Go and then query them, rather than using the DuckDB S3 extension with those objects directly (the whole execution ran on the order of a couple hundred milliseconds).

评论 #38165968 未加载

benrutter超过 1 年前

I love the increased focus on benchmark testing recently, but I always find it a little weird to read for stuff like spark or dask.Those are written to offer scale over large data so have very different overheads and limits compared to something like duckdb. Seems odd to have them on the same chart.Also, side note but I'd love to see the performance impact of pandas/dask with pyarrow schemas.

jcuenod超过 1 年前

Why do so many of the 0.5GB clickhouse benchmarks fail?

esafak超过 1 年前

Reminder that you can use Fugue as a unified API and swap out back-ends, including DuckDB: <a href="https://fugue-tutorials.readthedocs.io/tutorials/integrations/backends/duckdb.html" rel="nofollow noreferrer">https://fugue-tutorials.readthedocs.io/tutorials/integration...</a>

评论 #38165863 未加载

riku_iki超过 1 年前

Kinda odd that they have max 50GB dataset and run it on machine with 160GB ram, so no testing of out of memory capabilities.

timost超过 1 年前

Interestingly, Dask runs out of memory on many of the tasks of the benchmark.

评论 #38166697 未加载

vorticalbox超过 1 年前

It's nice that they put the Tldr at the top, im not sure why but a lot of places, for some weird reason, put it at the end.

jcuenod超过 1 年前

Really interesting to compare to Clickhouse's benchmark, when you filter out the non-comparable results. The TLDR is that their benchmark shows DuckDB winning a lot of the races:<a href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIChwYXJ0aXRpb25lZCkiOmZhbHNlLCJBdGhlbmEgKHNpbmdsZSkiOmZhbHNlLCJBdXJvcmEgZm9yIE15U1FMIjpmYWxzZSwiQXVyb3JhIGZvciBQb3N0Z3JlU1FMIjpmYWxzZSwiQnlDb25pdHkiOmZhbHNlLCJCeXRlSG91c2UiOmZhbHNlLCJjaERCIjpmYWxzZSwiQ2l0dXMiOmZhbHNlLCJDbGlja0hvdXNlIENsb3VkIChhd3MpIjp0cnVlLCJDbGlja0hvdXNlIENsb3VkIChnY3ApIjp0cnVlLCJDbGlja0hvdXNlIChkYXRhIGxha2UsIHBhcnRpdGlvbmVkKSI6dHJ1ZSwiQ2xpY2tIb3VzZSAoUGFycXVldCwgcGFydGl0aW9uZWQpIjp0cnVlLCJDbGlja0hvdXNlIChQYXJxdWV0LCBzaW5nbGUpIjp0cnVlLCJDbGlja0hvdXNlICh3ZWIpIjp0cnVlLCJDbGlja0hvdXNlIjp0cnVlLCJDbGlja0hvdXNlICh0dW5lZCkiOnRydWUsIkNsaWNrSG91c2UgKHpzdGQpIjp0cnVlLCJDcmF0ZURCIjpmYWxzZSwiRGF0YWJlbmQiOmZhbHNlLCJEYXRhRnVzaW9uIChQYXJxdWV0LCBzaW5nbGUpIjpmYWxzZSwiQXBhY2hlIERvcmlzIjpmYWxzZSwiRHJ1aWQiOmZhbHNlLCJEdWNrREIgKFBhcnF1ZXQsIHBhcnRpdGlvbmVkKSI6ZmFsc2UsIkR1Y2tEQiI6dHJ1ZSwiRWxhc3RpY3NlYXJjaCI6ZmFsc2UsIkVsYXN0aWNzZWFyY2ggKHR1bmVkKSI6ZmFsc2UsIkdyZWVucGx1bSI6ZmFsc2UsIkhlYXZ5QUkiOmZhbHNlLCJIeWRyYSI6ZmFsc2UsIkluZm9icmlnaHQiOmZhbHNlLCJLaW5ldGljYSI6ZmFsc2UsIk1hcmlhREIgQ29sdW1uU3RvcmUiOmZhbHNlLCJNYXJpYURCIjpmYWxzZSwiTW9uZXREQiI6ZmFsc2UsIk1vbmdvREIiOmZhbHNlLCJNeVNRTCAoTXlJU0FNKSI6ZmFsc2UsIk15U1FMIjpmYWxzZSwiUGlub3QiOmZhbHNlLCJQb3N0Z3JlU1FMICh0dW5lZCkiOmZhbHNlLCJQb3N0Z3JlU1FMIjpmYWxzZSwiUXVlc3REQiAocGFydGl0aW9uZWQpIjpmYWxzZSwiUXVlc3REQiI6ZmFsc2UsIlJlZHNoaWZ0IjpmYWxzZSwiU2VsZWN0REIiOmZhbHNlLCJTaW5nbGVTdG9yZSI6ZmFsc2UsIlNub3dmbGFrZSI6ZmFsc2UsIlNRTGl0ZSI6ZmFsc2UsIlN0YXJSb2NrcyI6ZmFsc2UsIlRpbWVzY2FsZURCIChjb21wcmVzc2lvbikiOmZhbHNlLCJUaW1lc2NhbGVEQiI6ZmFsc2V9LCJ0eXBlIjp7InN0YXRlbGVzcyI6dHJ1ZSwibWFuYWdlZCI6dHJ1ZSwiSmF2YSI6dHJ1ZSwiY29sdW1uLW9yaWVudGVkIjp0cnVlLCJDKysiOnRydWUsIk15U1FMIGNvbXBhdGlibGUiOnRydWUsInJvdy1vcmllbnRlZCI6dHJ1ZSwiQyI6dHJ1ZSwiUG9zdGdyZVNRTCBjb21wYXRpYmxlIjp0cnVlLCJDbGlja0hvdXNlIGRlcml2YXRpdmUiOnRydWUsImVtYmVkZGVkIjp0cnVlLCJzZXJ2ZXJsZXNzIjp0cnVlLCJhd3MiOnRydWUsImdjcCI6dHJ1ZSwiUnVzdCI6dHJ1ZSwic2VhcmNoIjp0cnVlLCJkb2N1bWVudCI6dHJ1ZSwidGltZS1zZXJpZXMiOnRydWV9LCJtYWNoaW5lIjp7InNlcnZlcmxlc3MiOmZhbHNlLCIxNmFjdSI6ZmFsc2UsImM2YS40eGxhcmdlLCA1MDBnYiBncDIiOnRydWUsIkwiOmZhbHNlLCJNIjpmYWxzZSwiUyI6ZmFsc2UsIlhTIjpmYWxzZSwiYzZhLm1ldGFsLCA1MDBnYiBncDIiOmZhbHNlLCIxOTJHQiI6ZmFsc2UsIjI0R0IiOmZhbHNlLCIzNjBHQiI6ZmFsc2UsIjQ4R0IiOmZhbHNlLCI3MjBHQiI6ZmFsc2UsIjk2R0IiOmZhbHNlLCIxNDMwR0IiOmZhbHNlLCJkZXYiOmZhbHNlLCI3MDhHQiI6ZmFsc2UsImM1bi40eGxhcmdlLCA1MDBnYiBncDIiOmZhbHNlLCJjNS40eGxhcmdlLCA1MDBnYiBncDIiOmZhbHNlLCJtNWQuMjR4bGFyZ2UiOmZhbHNlLCJtNmkuMzJ4bGFyZ2UiOmZhbHNlLCJjNmEuNHhsYXJnZSwgMTUwMGdiIGdwMiI6ZmFsc2UsImRjMi44eGxhcmdlIjpmYWxzZSwicmEzLjE2eGxhcmdlIjpmYWxzZSwicmEzLjR4bGFyZ2UiOmZhbHNlLCJyYTMueGxwbHVzIjpmYWxzZSwiUzIiOmZhbHNlLCJTMjQiOmZhbHNlLCIyWEwiOmZhbHNlLCIzWEwiOmZhbHNlLCI0WEwiOmZhbHNlLCJYTCI6ZmFsc2V9LCJjbHVzdGVyX3NpemUiOnsiMSI6dHJ1ZSwiMiI6dHJ1ZSwiNCI6dHJ1ZSwiOCI6dHJ1ZSwiMTYiOnRydWUsIjMyIjp0cnVlLCI2NCI6dHJ1ZSwiMTI4Ijp0cnVlLCJzZXJ2ZXJsZXNzIjp0cnVlLCJkZWRpY2F0ZWQiOnRydWUsInVuZGVmaW5lZCI6dHJ1ZX0sIm1ldHJpYyI6ImhvdCIsInF1ZXJpZXMiOlt0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlLHRydWUsdHJ1ZSx0cnVlXX0=" rel="nofollow noreferrer">https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...</a>

19 条评论

dang超过 1 年前

Recent and related:DuckDB 0.9 - <a href="https://news.ycombinator.com/item?id=37657736">https://news.ycombinator.com/item?id=37657736</a> - Sept 2023 (59 comments)

评论 #38171492 未加载

treesciencebot超过 1 年前

评论 #38164658 未加载

评论 #38165346 未加载

评论 #38164831 未加载

评论 #38166566 未加载

wenc超过 1 年前

评论 #38165233 未加载

评论 #38164836 未加载

评论 #38167812 未加载

评论 #38166124 未加载

评论 #38165920 未加载

sega_sai超过 1 年前

评论 #38166444 未加载

评论 #38167788 未加载

MrPowers超过 1 年前

评论 #38167074 未加载

评论 #38166924 未加载

sroerick超过 1 年前

评论 #38164908 未加载

评论 #38167929 未加载

alamb超过 1 年前

spapas82超过 1 年前

评论 #38178901 未加载

vgt超过 1 年前

tristenharr超过 1 年前

indeedmug超过 1 年前

评论 #38166349 未加载

评论 #38168812 未加载

评论 #38166332 未加载

评论 #38166353 未加载

cube2222超过 1 年前

评论 #38165968 未加载

benrutter超过 1 年前

jcuenod超过 1 年前

Why do so many of the 0.5GB clickhouse benchmarks fail?

esafak超过 1 年前

评论 #38165863 未加载

riku_iki超过 1 年前

Kinda odd that they have max 50GB dataset and run it on machine with 160GB ram, so no testing of out of memory capabilities.

timost超过 1 年前

Interestingly, Dask runs out of memory on many of the tasks of the benchmark.

评论 #38166697 未加载

vorticalbox超过 1 年前

It's nice that they put the Tldr at the top, im not sure why but a lot of places, for some weird reason, put it at the end.

jcuenod超过 1 年前