Arrow is <i>the most important</i> thing happening in the data ecosystem right now. It's going to allow you to run your choice of execution engine, on top of your choice of data store, as though they are designed to work together. It will mostly be invisible to users, the key thing that needs to happen is that all the producers and consumers of batch data need to adopt Arrow as the common interchange format.<p>BigQuery recently implemented the storage API, which allows you to read BQ tables, in parallel, in Arrow format: <a href="https://cloud.google.com/bigquery/docs/reference/storage" rel="nofollow">https://cloud.google.com/bigquery/docs/reference/storage</a><p>Snowflake has adopted Arrow as the in-memory format for their JDBC driver, though to my knowledge there is still no way to access data in <i>parallel</i> from Snowflake, other than to export to S3.<p>As Arrow spreads across the ecosystem, users are going to start discovering that they can store data in one system and query it in another, at full speed, and it's going to be amazing.
Excited to see this release's official inclusion of the pure Julia Arrow implementation [1]!<p>It's so cool to be able mmap Arrow memory and natively manipulate it from within Julia with virtually no performance overhead. Since the Julia compiler can specialize on the layout of Arrow-backed types at runtime (just as it can with any other type), the notion of needing to build/work with a separate "compiler for fast UDFs" is rendered obsolete.<p>It feels pretty magical when two tools like this compose so well without either being designed with the other in mind - a testament to the thoughtful design of both :) mad props to Jacob Quinn for spearheading the effort to revive/restart Arrow.jl and get the package into this release.<p>[1] <a href="https://github.com/JuliaData/Arrow.jl" rel="nofollow">https://github.com/JuliaData/Arrow.jl</a>
If curious see also<p>2020 <a href="https://news.ycombinator.com/item?id=23965209" rel="nofollow">https://news.ycombinator.com/item?id=23965209</a><p>2018 (a bit) <a href="https://news.ycombinator.com/item?id=17383881" rel="nofollow">https://news.ycombinator.com/item?id=17383881</a><p>2017 <a href="https://news.ycombinator.com/item?id=15335462" rel="nofollow">https://news.ycombinator.com/item?id=15335462</a><p>2017 <a href="https://news.ycombinator.com/item?id=15594542" rel="nofollow">https://news.ycombinator.com/item?id=15594542</a> rediscussed recently <a href="https://news.ycombinator.com/item?id=25258626" rel="nofollow">https://news.ycombinator.com/item?id=25258626</a><p>2016 <a href="https://news.ycombinator.com/item?id=11118274" rel="nofollow">https://news.ycombinator.com/item?id=11118274</a><p>Also: related from a couple weeks ago <a href="https://news.ycombinator.com/item?id=25824399" rel="nofollow">https://news.ycombinator.com/item?id=25824399</a><p>related from a few months ago <a href="https://news.ycombinator.com/item?id=24534274" rel="nofollow">https://news.ycombinator.com/item?id=24534274</a><p>related from 2019 <a href="https://news.ycombinator.com/item?id=21826974" rel="nofollow">https://news.ycombinator.com/item?id=21826974</a>
This link is a 404. Perhaps they weren't intending this post to be public yet?<p>At any rate, archive.org managed to grab it <a href="https://web.archive.org/web/20210203194945/https://arrow.apache.org/blog/2021/01/25/3.0.0-release/" rel="nofollow">https://web.archive.org/web/20210203194945/https://arrow.apa...</a>
For use as a file format, where one priority is to compress columnar data as well as possible, the practical difference between Arrow (via Feather?), Parquet, and ORC is still somewhat vague to me. During my last investigation, I got the impression that Arrow worked great as a standard, interoperable, in-memory columnar format, but didn't compress nearly as well as ORC or Parquet due to lack of RLE and other compression schemes (other than dictionary). Is this still the case? Is there a world where Arrow completely supplants Parquet and/or ORC?<p>EDIT: Just found <a href="https://wesmckinney.com/blog/arrow-columnar-abadi" rel="nofollow">https://wesmckinney.com/blog/arrow-columnar-abadi</a>, which helps answer this question.
Can someone dig into the pros and cons of the columnar aspect of Arrow? To some degree there are many other data transfer formats but this one seems to promote its columnar orientation.<p>Things like eg. protobuffers support hierarchical data which seems like a superset of columns. Is there a benefit to a column based format? Is it an enforced simplification to ensure greater compatibility or is there some other reason?
So if I understand this correctly from an application developers perspective:<p>- for OLTP tasks, something row based like sqlite is great. Small to medium amounts of data mixed reading/writing with transactions<p>- for OLAP tasks, arrow looks great. Big amounts of data, faster querying (datafusion) and more compact data files with parquet.<p>Basically prevent the operational database from growing too large, offload older data to arrow/parquet. Did I get this correct?<p>Additionally there seem to be further benefits like sharing arrow/parquet with other consumers.<p>Sounds convincing, I just have two very specific questions:<p>- if I load a ~2GB collection of items into arrow and query it with datafusion, how much slower will this perform in comparison to my current rust code that holds a large Vec in memory and „queries“ via iter/filter?<p>- if I want to move data from sqlite to a more permanent parquet „Archive“ file, is there a better way than recreating the whole file or write additional files, like, appending?<p>Really curious, could find no hints online so far to get an idea.
Related - recent Apache Arrow support in Julia announcement - <a href="https://julialang.org/blog/2021/01/arrow/" rel="nofollow">https://julialang.org/blog/2021/01/arrow/</a>
I got interested in Arrow recently after reading this blog post showing that Arrow (and Ray) are much faster than Pickle: <a href="https://rise.cs.berkeley.edu/blog/fast-python-serialization-ray-apache-arrow/" rel="nofollow">https://rise.cs.berkeley.edu/blog/fast-python-serialization-...</a><p>I have a question about whether it would fit this use-case:<p>* I need a SUPER fast KV-store.<p>* I'm on a single machine.<p>* Keys are 10-bytes if you compress (or strings with 32 characters if you don't), unfortunately I can't store it as an 8-byte int. sqlite said it supports arbitrary precision numerics, but then I got burned finding out that casts integers to arbitrary precision floats and only keeps the first 14 digits of precision :\<p>* Values are 4-byte ints. Maybe 3 4-byte ints.<p>* I have maybe 10B - 100B rows.<p>* I need super fast lookup and depending upon my machine can't always cache this in memory, might need to work from disk.<p>Would arrow be useful for this? Currently just using sqlite.
Would really love to see first class support for Javascript/Typescript for data visualization purposes. The columnar format would naturally lend itself to an Entity-Component style architecture with TypedArrays.