Demystifying Apache Arrow (2020)

197 pointsby dmlorenzettiover 2 years ago

13 comments

RobinLover 2 years ago

Author here. Since I wrote this, Arrow seems to be be more and more pervasive. As a data engineer, the adoption of Arrow (and parquet) as a data exchange format has so much value. It's amazing how much time me and colleagues have spent on data type issues that have arisen from the wide range of data tooling (R, Pandas, Excel etc. etc.). So much so that I try to stick to parquet, using SQL where possible to easily preserve data types (pandas is a particularly bad offender for managing data types).In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS Athena and so on. The list of tools using Arrow is long! <a href="https://arrow.apache.org/powered_by/" rel="nofollow">https://arrow.apache.org/powered_by/</a>Another interesting development since I wrote this is DuckDB.DuckDB offers a compute engine with great performance against parquet files and other formats. Probably similar performance to Arrow. It's interesting they opted to write their own compute engine rather than use Arrow's - but I believe this is partly because Arrow was immature when they were starting out. I mention it because, as far as I know, there's not yet an easy SQL interface to Arrow from Python.Nonetheless, DuckDB are still Arrow for some of its other features: <a href="https://duckdb.org/2021/12/03/duck-arrow.html" rel="nofollow">https://duckdb.org/2021/12/03/duck-arrow.html</a>Arrow also has a SQL query engine: <a href="https://arrow.apache.org/blog/2019/02/04/datafusion-donation/" rel="nofollow">https://arrow.apache.org/blog/2019/02/04/datafusion-donation...</a>I might be wrong about this - but in my experience, it feels like there's more consensus around the Arrow format, as opposed to the compute side.Going forward, I see parquet continuing on its path to becoming a de facto standard for storing and sharing bulk data. I'm particularly excited about new tools that allow you to process it in the browser. I've written more about this just yesterday: <a href="https://www.robinlinacre.com/parquet_api/" rel="nofollow">https://www.robinlinacre.com/parquet_api/</a>, discussion: <a href="https://news.ycombinator.com/item?id=34310695" rel="nofollow">https://news.ycombinator.com/item?id=34310695</a>.

评论 #34328580 未加载

评论 #34324711 未加载

kajika91over 2 years ago

I prefer looking at benchmarks : <a href="https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d" rel="nofollow">https://towardsdatascience.com/the-best-format-to-save-panda...</a>I have used Arrow and even made my humble contribution to the Go binding but I don't like pretending it is so much better than other solutions. It is not a silver bullet and probably the best pro is the "non-copy" goal to convert data into different frameworks' object. Depending of the use for the data columnar layout can be better but not always.

Lyngbakrover 2 years ago

This has been a game changer for us. When our analysts run queries on parquets using Arrow they are orders of magnitude faster than equivalent SQL queries on databases.

评论 #34323350 未加载

评论 #34323271 未加载

alambover 2 years ago

Here is another blog post that offers some perspective on the growth of Arrow over the intervening years and future directions: <a href="https://www.datawill.io/posts/apache-arrow-2022-reflection/" rel="nofollow">https://www.datawill.io/posts/apache-arrow-2022-reflection/</a>

评论 #34332541 未加载

agumonkeyover 2 years ago

Very interesting projectps: a tiny video to explain storage layout optimizations <a href="https://yewtu.be/watch?v=dPb2ZXnt2_U" rel="nofollow">https://yewtu.be/watch?v=dPb2ZXnt2_U</a>

gizmodo59over 2 years ago

There is a bunch of other projects that grew out of arrow which are also contributing a lot to data engineering: <a href="https://www.dremio.com/blog/apache-arrows-rapid-growth-over-the-years/" rel="nofollow">https://www.dremio.com/blog/apache-arrows-rapid-growth-over-...</a>

flakinessover 2 years ago

FYI: A recent "Data Analysis Podcast" interviews the Arrows founder Wes McKinney on this topic.<a href="https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-unlock" rel="nofollow">https://roundup.getdbt.com/p/ep-37-what-does-apache-arrow-un...</a>

hermitcrabover 2 years ago

I have written a desktop data wrangling/ETL tool for Windows/Mac (Easy Data Transform). It is designed to handle millions of rows (but not billions). Currently it mostly inputs and outputs CSV, Excel, XML and JSON. I am looking to add some additional formats in future, such as SQLite, Parquet or DuckBD. Maybe I need to look at Feather as well? I could also use one of these formats to store intermediate datasets to disk, rather than holding everything in memory. If anyone has any experience in integrating any of these formats into a C++ application on Windows and/or Mac, I would be interested to hear how you got on and what libraries you used.

d_burfootover 2 years ago

Can someone comment on the code quality of Arrow vs other Apache data engineering tools?I have been burned so many times by amateur hour software engineering failures from the Apache world, that it’s very hard for me to ever willingly adopt anything from that brand again. Just put it in gripped JSon or TSV and hey, if there’s a performance penalty, it’s better to pay a bit more for cloud compute than hate your job because of some nonsense dependency issue caused by an org.Apache library failing to follow proper versioning guidelines.

评论 #34326584 未加载

评论 #34324723 未加载

rr888over 2 years ago

I always thought the file format was going to be tightly bound to Arrow but looks like they aren't encouraging feather. Should we just be using Parquet for file storage?

评论 #34324649 未加载

kordlessagainover 2 years ago

FeatureBase uses Arrow for processing data stored in bitmap format: <a href="https://featurebase.com/" rel="nofollow">https://featurebase.com/</a>

mjburgessover 2 years ago

> Learning more about a tool that can filter and aggregate two billion rows on a laptop in two secondsIf someone has a code example to this effect, I'd be greatful.I was once engaged in a salesy pitch by a cloud advocate that BigQuery (et al.) can "process a billion rows a second".I tried to create an SQLite example with a billion rows to show that this isn't impressive, but I gave up after some obstacles to generating the data.It would be nice to have an example like this to show developers (, engineers) who have become accustomed to the extreme levels of CPU abuse today, to show that modern laptops really are supercomputers.It should be obvious that a laptop can rival a data centre at 90% of ordinary tasks, that it isn't in my view, has a lot to do with the state of OS/Browser/App/etc. design & performance. Supercomputers, alas, dedicated to drawing pixels by way of a dozen layers of indirection.

评论 #34323527 未加载

评论 #34323124 未加载

评论 #34323205 未加载

评论 #34323532 未加载

评论 #34323417 未加载

评论 #34323164 未加载

评论 #34326162 未加载

amayuiover 2 years ago

We've been thinking about using Parquet as a serialization format during data exchange in our project as well.