Comparing SQLite, DuckDB and Arrow with UN trade data

246 点作者 marcle超过 3 年前

19 条评论

chrisjc超过 3 年前

I'm a little confused with Arrow being included in the comparison.What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?But isn't Arrow a format too? As I understand it, Arrow is a format optimized for transferring in-memory data from one distributed system to another (ser-des), while also facilitating and optimizing certain set operations. From RAM in one system to RAM in another.Moreover, since Arrow is a format, why is it being compared to databases like SQLite and DuckDB? If we're talking about formats, why not compare Arrow queries against Parquet data to DuckDB queries against Parquet data? <a href="https://duckdb.org/docs/data/parquet" rel="nofollow">https://duckdb.org/docs/data/parquet</a>Why not at least benchmark the query execution alone instead of startup and loading of data? For Arrow, isn't it assumed that there is an engine like Spark or Snowflake already up and running that's serving you data in the Arrow format? Ideally, with Arrow you should never be dealing with data starting in a resting format like Parquet. The data should already be in RAM to reap the benefits of Arrow. Its value proposition is it'll get "live" data from point A to B as efficiently as possible, in an open, non-proprietary, ubiquitous (eventually) format.Exactly what of SQLite, DuckDB and Arrow is being compared here?I would assume the benefits of Arrow in R (or DataFrames in general) would be getting data from a data engine into your DataFrame runtime as efficiently as possible. (just as interesting might be where and how push-downs are handled)Perhaps I'm missing the trees for the forest?No disrespect to the author... Seems like they're on a quest for knowledge, and while the article is confusing to me, it certainly got me thinking.Disclaimer: I don't read R too good, and I'm still struggling with what exactly Arrow is. (Comparisons like this actually leave me even more confused about what Arrow is)

评论 #29011463 未加载

评论 #29011150 未加载

评论 #29011188 未加载

评论 #29012132 未加载

isoprophlex超过 3 年前

The relevant results from the linked article:<pre><code> ## format median_time mem_alloc ## 1 R (RDS) 1.34m 4.08GB ## 2 SQL (SQLite) 5.48s 6.17MB ## 3 SQL (DuckDB) 1.76s 104.66KB ## 4 Arrow (Parquet) 1.36s 453.89MB </code></pre> I'd bet that doing the same with Pandas would require time and space similar to RDS (1). I really hope DuckDB makes it in the Python world, everything I read about it seems very promising. Using it myself for toy projects was pleasant, too.

评论 #29010657 未加载

评论 #29015137 未加载

评论 #29010639 未加载

评论 #29011369 未加载

评论 #29010822 未加载

评论 #29010407 未加载

kristjansson超过 3 年前

Lots of confusion derives from the best-of-breed parquet readers for Python and R residing in the Arrow packages, mostly because Arrow is (and does) a lot of things.There's:* arrow, a in-memory format for dataframe-shaped things supporting fast computation, zero-copy sharing, etc.* arrow Feather V1, an on-disk format for dataframe-shaped things* arrow IPC, a (de)serialization for arrow buffers and a protocol for sending/receiving to other processes.* arrow Feather V2, an on-disk format that's basically the IPC serialization written to file[1]* arrow Flight, a protocol for requesting/sending/receiving data to remote machines that's basically gRPC layered over the IPC format* arrow DataFusion/Ballista, nascent system(s) for local/distributed query execution over arrow buffers* other subprojects I'm surely forgetting* a (very good) C++ Parquet reader [2]/[3] developed under the auspices of the project* libraries in / bindings to many languages, including R and Python, supporting interaction with (subsets of) the above.It's only the last piece that's exposed to most data-science-y users, and thus identified with the 'arrow' name. Since those libraries are also very good, and hiding their abstractions well, those users are free to use the functionality relevant to them, be it dealing with parquet, feather, etc. without needing to understand how they work.Not that this is a criticism of the project, or those users! Arrow encompasses lots of different functionality, which enables it to provide different things to different people. As a result, though, 'Arrow' connotes lots of different things (and different _sorts_ of things) to different users, which can cause some confusion if terms aren't fully specified, or even a bit misunderstood[1] <a href="https://stackoverflow.com/a/67911190/881025" rel="nofollow">https://stackoverflow.com/a/67911190/881025</a> [2] <a href="https://github.com/apache/parquet-cpp" rel="nofollow">https://github.com/apache/parquet-cpp</a> [3] <a href="https://github.com/apache/arrow/tree/master/cpp" rel="nofollow">https://github.com/apache/arrow/tree/master/cpp</a>

dgudkov超过 3 年前

DuckDB is faster than SQLite on an analytical type of query. That's expected because DuckDB is column-based and designed exactly for this type of queries.The analysis would be more telling if it measured insert performance as well.

评论 #29011597 未加载

评论 #29015289 未加载

cosmaioan超过 3 年前

I would be curios to see DuckDB with parquet <a href="https://duckdb.org/docs/data/parquet" rel="nofollow">https://duckdb.org/docs/data/parquet</a>

评论 #29011913 未加载

ryndbfsrw超过 3 年前

I am certain I'm in the wrong here but I'm struggling to understand Arrow's USP. I (originally) assumed it meant python/R users would be able to get around memory limitations when model-fitting but all the examples I've come across are just data manipulation and none of the main modeling packages support it. Those who are using it, what am I missing?

评论 #29010798 未加载

评论 #29012286 未加载

评论 #29012649 未加载

评论 #29011364 未加载

评论 #29012194 未加载

einpoklum超过 3 年前

Note that:* The comparison used a single, simple, query: filter, group by, and sum.* The comparison does not include repeated multiple queries (similar or dissimilar).* A moderate amount of data is used.This is certainly relevant and important, but it's not an extensive comparison.

评论 #29011383 未加载

homerowilson超过 3 年前

I enjoyed this comparison, thanks! Here is a related generally R-centric comparison that you might enjoy of DuckDB, dplyr, data.table, etc. applied to five data-sciency problems I wrote up a few months ago: <a href="https://github.com/bwlewis/duckdb_and_r" rel="nofollow">https://github.com/bwlewis/duckdb_and_r</a>

HighChaparral超过 3 年前

Possibly nothing more than a side issue, but why are the indexes being created before the bulk insert, rather than after?

评论 #29012559 未加载

knome超过 3 年前

>dbSendQuery(con, "CREATE INDEX year ON yrpc (year)")>dbSendQuery(con, "CREATE INDEX reporter_iso ON yrpc (reporter_iso)")would sqlite use these for the query given?it seems they wouldn't help with grouping, and the grouping seems antagonistic to filtering. if you're trying for performance, you can avoid having the query engine refer back to the original table during processing by using a covering index that includes the data fields in addition to the query fields"CREATE INDEX covering (reporter_iso, year, trade_value_usd_exp, trade_value_usd_imp)"or maybe even something like"CREATE INDEX covering (reporter_iso, year, reporter_iso, trade_value_usd_exp, trade_value_usd_imp)"Though I'm not sure how it's query planner would do with a duplicate column in the index.

评论 #29016581 未加载

CRConrad超过 3 年前

Here's another HN page on an article comparing SQLite to other DBs, from just the other day :-) <a href="https://news.ycombinator.com/item?id=7432619" rel="nofollow">https://news.ycombinator.com/item?id=7432619</a>

mjburgess超过 3 年前

Does the R code, for eg., SQLite, actually build an SQL query?This just looks like a tidyverse library comparison. I'd expect a benchmark using their own libs, rather than assuming tidyverse will have an optimal way of querying them.

评论 #29010498 未加载

marcle超过 3 年前

TLDR: Arrow and DuckDB provide fast database aggregates compared with R's RDS format and, to an extent, SQLite.It is unclear how much functionality is available for Arrow under R: any comments? It would also be interesting to see a similar benchmark for Python, which could include the embedded version of MonetDB -- an R package for MonetDB/e is not yet available.Edit: amended the TLDR to reflect jhoechtl's and wodenokoto's comments. SQLite provided reasonably memory efficient aggregates.

评论 #29010525 未加载

评论 #29010350 未加载

评论 #29010441 未加载

flakiness超过 3 年前

Although this article is focusing on dplyr which provides an uniform API over various data sources, I guess the choice depends more on the personal taste: Do you like staying on R / Python or on SQL?If you prefer SQL, using raw Arrow or in-memory data doesn't make sense - Except duckdb does support SQL queries over these non-db data sources as well.Anyway, R data ecosystem very rich for both R and SQL fans and I as a Python user feel a bit envious.

vertere超过 3 年前

I'm a bit surprised the DuckDB file size is that large (5x Parquet, and nearly as large as sqlite). I haven't used DuckDB, but I thought it had columnar storage and so would have better compression.

评论 #29015435 未加载

xiaodai超过 3 年前

I think disk.frame should fare pretty well if you use the `srckeep` function.

Fiahil超过 3 年前

So DuckDB is a kind of DataFusion (<a href="https://arrow.apache.org/datafusion/" rel="nofollow">https://arrow.apache.org/datafusion/</a>) ?

评论 #29015544 未加载

评论 #29011385 未加载

评论 #29011264 未加载

WFHRenaissance超过 3 年前

Why aren't indexes being created on the RDS instance? I know it's out of scope based on the title, but also it's used for comparison in the final results table.

m0zg超过 3 年前

Might as well try Arrow Feather, there's an R binding for it. This Flatbuffers-based format can be memory mapped and it requires very little decoding. Super fast and light on resources.Also, Parquet supports different encodings, some of which may be faster, denser, and/or consume less RAM, or all of the above.