If you don't speak press-release, this is a cool project to create an in-memory interop format for columnar data, so various tools can share data without a serialize/deserialize step which is very expensive compared to copying or sharing memory on the same machine or within a local network.<p><a href="https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=blob;f=format/Layout.md;h=c393163bf894bab283641882d9aa4a8c2ef0ef8e;hb=HEAD" rel="nofollow">https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=blob;f...</a><p>(edited post because I fail reading git and didn't notice the java implementation)
Nice initiative. Cheap serde and cross-language compatibility with an eye towards data scan intensive workloads is an important component!<p>Have you folks considered Supersonic engine from Google, which was designed with similar (but not as extensive as Arrow) goals in mind?<p><a href="https://github.com/google/supersonic" rel="nofollow">https://github.com/google/supersonic</a>
In-memory only... How is this better than the SFrame implementation from Dato (2015) that was posted here a couple of days ago?<p><a href="https://news.ycombinator.com/item?id=11106501" rel="nofollow">https://news.ycombinator.com/item?id=11106501</a>
I had a hard time parsing out the technical bits behind the facade of titles and endorsements. I didn't know you could be a Vice President of an open source product. Two roles I'm familiar with: committer and PMC chair. Now VPs. Just saying...
Not to completely denigrate Apache but I have found contributing to Apache projects (not the license but real apache projects) to be somewhat of a hassle. They generally do not do PRs but rather patches via email or attached to bugs (perhaps some do but I have yet to see one that does), require signatures for contribution, and JIRA is really getting slow these days.<p>I'm somewhat ignorant as I don't run any Apache projects but I'm curious as to why people choose Apache to back their project these days. I guess why choose a committee instead of just leaving it on Github. I suppose its the whole voting and board stuff.
Nice to see a new columnar data format alternative. Just a quick question though.<p>The existing columnar data formats such as Parquet and ORC aim to be space-efficient since the data is stored in disk and IO operations are usually the bottleneck. The columnar data formats shine in big-data area so the amount of data will be huge. Given that columnar data formats can be compressed efficiently and that's of the main points of columnar data formats such as Parquet and ORC, I'm not sure that I understand the main point of in-memory columnar data formats.<p>Once the data is in-memory and we can access any column of a row in constant-time what's the difference between a row-oriented data format and columnar data format?
I'm interpreting this as saying they want to use the same representation to use in memory (for querying) and for 'serialization' (sending the same thing over the wire). This begs the question why separate serialized representations ever became a thing in the first place.<p>My understanding is that serialization became a thing because in-memory representations tend to use pointers to shared data structures that may thus be referenced multiple times while being stored only once. This would not translate 1:1 to serialized representations (where memory offsets would no longer hold meaning) -- much less in any language-agnostic way.<p>So I have this suspicion that Apache Arrow would not support reusing duplicate data while storing it only once. Would anyone mind clarifying on this point?
"Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."<p>How will Arrow use vectorized instructions on the JVM? That seems to be only available to the JIT and JNI, which is a frustrating limitation.
More technical details available here:<p><a href="http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/" rel="nofollow">http://blog.cloudera.com/blog/2016/02/introducing-apache-arr...</a>