Apache Arrow 3.0

561 pointsby kylebarronover 4 years ago

23 comments

Arrow is the most important thing happening in the data ecosystem right now. It's going to allow you to run your choice of execution engine, on top of your choice of data store, as though they are designed to work together. It will mostly be invisible to users, the key thing that needs to happen is that all the producers and consumers of batch data need to adopt Arrow as the common interchange format.BigQuery recently implemented the storage API, which allows you to read BQ tables, in parallel, in Arrow format: <a href="https://cloud.google.com/bigquery/docs/reference/storage" rel="nofollow">https://cloud.google.com/bigquery/docs/reference/storage</a>Snowflake has adopted Arrow as the in-memory format for their JDBC driver, though to my knowledge there is still no way to access data in parallel from Snowflake, other than to export to S3.As Arrow spreads across the ecosystem, users are going to start discovering that they can store data in one system and query it in another, at full speed, and it's going to be amazing.

评论 #26018827 未加载

评论 #26020157 未加载

评论 #26020657 未加载

评论 #26018622 未加载

评论 #26023684 未加载

评论 #26020563 未加载

评论 #26024680 未加载

评论 #26019814 未加载

评论 #26020377 未加载

评论 #26019707 未加载

评论 #26019583 未加载

jrevelsover 4 years ago

Excited to see this release's official inclusion of the pure Julia Arrow implementation [1]!It's so cool to be able mmap Arrow memory and natively manipulate it from within Julia with virtually no performance overhead. Since the Julia compiler can specialize on the layout of Arrow-backed types at runtime (just as it can with any other type), the notion of needing to build/work with a separate "compiler for fast UDFs" is rendered obsolete.It feels pretty magical when two tools like this compose so well without either being designed with the other in mind - a testament to the thoughtful design of both :) mad props to Jacob Quinn for spearheading the effort to revive/restart Arrow.jl and get the package into this release.[1] <a href="https://github.com/JuliaData/Arrow.jl" rel="nofollow">https://github.com/JuliaData/Arrow.jl</a>

评论 #26020652 未加载

评论 #26020621 未加载

dangover 4 years ago

If curious see also2020 <a href="https://news.ycombinator.com/item?id=23965209" rel="nofollow">https://news.ycombinator.com/item?id=23965209</a>2018 (a bit) <a href="https://news.ycombinator.com/item?id=17383881" rel="nofollow">https://news.ycombinator.com/item?id=17383881</a>2017 <a href="https://news.ycombinator.com/item?id=15335462" rel="nofollow">https://news.ycombinator.com/item?id=15335462</a>2017 <a href="https://news.ycombinator.com/item?id=15594542" rel="nofollow">https://news.ycombinator.com/item?id=15594542</a> rediscussed recently <a href="https://news.ycombinator.com/item?id=25258626" rel="nofollow">https://news.ycombinator.com/item?id=25258626</a>2016 <a href="https://news.ycombinator.com/item?id=11118274" rel="nofollow">https://news.ycombinator.com/item?id=11118274</a>Also: related from a couple weeks ago <a href="https://news.ycombinator.com/item?id=25824399" rel="nofollow">https://news.ycombinator.com/item?id=25824399</a>related from a few months ago <a href="https://news.ycombinator.com/item?id=24534274" rel="nofollow">https://news.ycombinator.com/item?id=24534274</a>related from 2019 <a href="https://news.ycombinator.com/item?id=21826974" rel="nofollow">https://news.ycombinator.com/item?id=21826974</a>

mushufasaover 4 years ago

Can someone ELI5 what problems are best solved by apache arrow?

评论 #26018661 未加载

评论 #26019483 未加载

评论 #26018904 未加载

评论 #26018723 未加载

评论 #26018655 未加载

评论 #26018715 未加载

评论 #26018830 未加载

评论 #26019465 未加载

评论 #26018648 未加载

评论 #26018731 未加载

评论 #26021093 未加载

评论 #26018679 未加载

ryanianianover 4 years ago

This link is a 404. Perhaps they weren't intending this post to be public yet?At any rate, archive.org managed to grab it <a href="https://web.archive.org/web/20210203194945/https://arrow.apache.org/blog/2021/01/25/3.0.0-release/" rel="nofollow">https://web.archive.org/web/20210203194945/https://arrow.apa...</a>

评论 #26022721 未加载

archagonover 4 years ago

For use as a file format, where one priority is to compress columnar data as well as possible, the practical difference between Arrow (via Feather?), Parquet, and ORC is still somewhat vague to me. During my last investigation, I got the impression that Arrow worked great as a standard, interoperable, in-memory columnar format, but didn't compress nearly as well as ORC or Parquet due to lack of RLE and other compression schemes (other than dictionary). Is this still the case? Is there a world where Arrow completely supplants Parquet and/or ORC?EDIT: Just found <a href="https://wesmckinney.com/blog/arrow-columnar-abadi" rel="nofollow">https://wesmckinney.com/blog/arrow-columnar-abadi</a>, which helps answer this question.

jayd16over 4 years ago

Can someone dig into the pros and cons of the columnar aspect of Arrow? To some degree there are many other data transfer formats but this one seems to promote its columnar orientation.Things like eg. protobuffers support hierarchical data which seems like a superset of columns. Is there a benefit to a column based format? Is it an enforced simplification to ensure greater compatibility or is there some other reason?

评论 #26019795 未加载

评论 #26019413 未加载

评论 #26019278 未加载

评论 #26019400 未加载

评论 #26019338 未加载

anonyfoxover 4 years ago

So if I understand this correctly from an application developers perspective:- for OLTP tasks, something row based like sqlite is great. Small to medium amounts of data mixed reading/writing with transactions- for OLAP tasks, arrow looks great. Big amounts of data, faster querying (datafusion) and more compact data files with parquet.Basically prevent the operational database from growing too large, offload older data to arrow/parquet. Did I get this correct?Additionally there seem to be further benefits like sharing arrow/parquet with other consumers.Sounds convincing, I just have two very specific questions:- if I load a ~2GB collection of items into arrow and query it with datafusion, how much slower will this perform in comparison to my current rust code that holds a large Vec in memory and „queries“ via iter/filter?- if I want to move data from sqlite to a more permanent parquet „Archive“ file, is there a better way than recreating the whole file or write additional files, like, appending?Really curious, could find no hints online so far to get an idea.

评论 #26020380 未加载

srikuover 4 years ago

Related - recent Apache Arrow support in Julia announcement - <a href="https://julialang.org/blog/2021/01/arrow/" rel="nofollow">https://julialang.org/blog/2021/01/arrow/</a>

bravuraover 4 years ago

I got interested in Arrow recently after reading this blog post showing that Arrow (and Ray) are much faster than Pickle: <a href="https://rise.cs.berkeley.edu/blog/fast-python-serialization-ray-apache-arrow/" rel="nofollow">https://rise.cs.berkeley.edu/blog/fast-python-serialization-...</a>I have a question about whether it would fit this use-case:* I need a SUPER fast KV-store.* I'm on a single machine.* Keys are 10-bytes if you compress (or strings with 32 characters if you don't), unfortunately I can't store it as an 8-byte int. sqlite said it supports arbitrary precision numerics, but then I got burned finding out that casts integers to arbitrary precision floats and only keeps the first 14 digits of precision :\* Values are 4-byte ints. Maybe 3 4-byte ints.* I have maybe 10B - 100B rows.* I need super fast lookup and depending upon my machine can't always cache this in memory, might need to work from disk.Would arrow be useful for this? Currently just using sqlite.

评论 #26021328 未加载

评论 #26021331 未加载

mbyioover 4 years ago

I'm surprised they are still making breaking changes, and they plan to make more (they are already working on a 4.0).

评论 #26019240 未加载

MR4Dover 4 years ago

Uh... 404 error for the link....

supunkkover 4 years ago

Cudf and Cylon are two execution engines natively supporting Arrow format<a href="https://github.com/rapidsai/cudf" rel="nofollow">https://github.com/rapidsai/cudf</a> <a href="https://github.com/cylondata/cylon" rel="nofollow">https://github.com/cylondata/cylon</a>

atianover 4 years ago

Has anyone had success in getting the page to load? I'm on my desktop and can't see what's behind the link.

评论 #26020679 未加载

peachy_no_pieover 4 years ago

How is Arrow when it comes to streaming workflows? Like, could Arrow replace Storm as analytics in a pipeline from Flume?

liminalover 4 years ago

Would really love to see first class support for Javascript/Typescript for data visualization purposes. The columnar format would naturally lend itself to an Entity-Component style architecture with TypedArrays.

评论 #26019977 未加载

评论 #26032263 未加载

评论 #26020134 未加载

chensterover 4 years ago

Why no love for PHP I wonder? Don't see a supported library there.

评论 #26020802 未加载

评论 #26026574 未加载

mushufasaover 4 years ago

has anyone had success using arrow in js to feed tabular data to the frontend from the backend, such as a pandas data frame?

Thaxllover 4 years ago

Last time I worked in ETL was with Hadoop, looks like a lot happened.

评论 #26019159 未加载

humbleMouseover 4 years ago

I worked at a large company a few years ago on a team implementing this. It’s super cool and works great. Definitely where the future is headed

offtop5over 4 years ago

Does no one do load testing anymore, anyone got a working mirror

skratloover 4 years ago

Yay, another ad-tech support engine from Apache, great

katsover 4 years ago

I'm done with Hacker News, you guys just upvote marketing and politics.

评论 #26020991 未加载

评论 #26021575 未加载

23 comments

georgewfraserover 4 years ago

评论 #26018827 未加载

评论 #26020157 未加载

评论 #26020657 未加载

评论 #26018622 未加载

评论 #26023684 未加载

评论 #26020563 未加载

评论 #26024680 未加载

评论 #26019814 未加载

评论 #26020377 未加载

评论 #26019707 未加载

评论 #26019583 未加载

jrevelsover 4 years ago

评论 #26020652 未加载

评论 #26020621 未加载

dangover 4 years ago

mushufasaover 4 years ago

Can someone ELI5 what problems are best solved by apache arrow?

评论 #26018661 未加载

评论 #26019483 未加载

评论 #26018904 未加载

评论 #26018723 未加载

评论 #26018655 未加载

评论 #26018715 未加载

评论 #26018830 未加载

评论 #26019465 未加载

评论 #26018648 未加载

评论 #26018731 未加载

评论 #26021093 未加载

评论 #26018679 未加载

ryanianianover 4 years ago

评论 #26022721 未加载

archagonover 4 years ago

jayd16over 4 years ago

评论 #26019795 未加载

评论 #26019413 未加载

评论 #26019278 未加载

评论 #26019400 未加载

评论 #26019338 未加载

anonyfoxover 4 years ago

评论 #26020380 未加载

srikuover 4 years ago

Related - recent Apache Arrow support in Julia announcement - <a href="https://julialang.org/blog/2021/01/arrow/" rel="nofollow">https://julialang.org/blog/2021/01/arrow/</a>

bravuraover 4 years ago

评论 #26021328 未加载

评论 #26021331 未加载

mbyioover 4 years ago

I'm surprised they are still making breaking changes, and they plan to make more (they are already working on a 4.0).

评论 #26019240 未加载

MR4Dover 4 years ago

Uh... 404 error for the link....

supunkkover 4 years ago

atianover 4 years ago

Has anyone had success in getting the page to load? I'm on my desktop and can't see what's behind the link.

评论 #26020679 未加载

peachy_no_pieover 4 years ago

How is Arrow when it comes to streaming workflows? Like, could Arrow replace Storm as analytics in a pipeline from Flume?

liminalover 4 years ago

评论 #26019977 未加载

评论 #26032263 未加载

评论 #26020134 未加载

chensterover 4 years ago

Why no love for PHP I wonder? Don't see a supported library there.

评论 #26020802 未加载

评论 #26026574 未加载

mushufasaover 4 years ago

has anyone had success using arrow in js to feed tabular data to the frontend from the backend, such as a pandas data frame?

Thaxllover 4 years ago

Last time I worked in ETL was with Hadoop, looks like a lot happened.

评论 #26019159 未加载

humbleMouseover 4 years ago

I worked at a large company a few years ago on a team implementing this. It’s super cool and works great. Definitely where the future is headed

offtop5over 4 years ago

Does no one do load testing anymore, anyone got a working mirror

skratloover 4 years ago

Yay, another ad-tech support engine from Apache, great

katsover 4 years ago

I'm done with Hacker News, you guys just upvote marketing and politics.

评论 #26020991 未加载

评论 #26021575 未加载