TechEcho

12 comments

bcoatesover 9 years ago

If you don't speak press-release, this is a cool project to create an in-memory interop format for columnar data, so various tools can share data without a serialize/deserialize step which is very expensive compared to copying or sharing memory on the same machine or within a local network.<a href="https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=blob;f=format/Layout.md;h=c393163bf894bab283641882d9aa4a8c2ef0ef8e;hb=HEAD" rel="nofollow">https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=blob;f...</a>(edited post because I fail reading git and didn't notice the java implementation)

评论 #11118762 未加载

评论 #11118845 未加载

评论 #11118856 未加载

评论 #11123998 未加载

xtacyover 9 years ago

Nice initiative. Cheap serde and cross-language compatibility with an eye towards data scan intensive workloads is an important component!Have you folks considered Supersonic engine from Google, which was designed with similar (but not as extensive as Arrow) goals in mind?<a href="https://github.com/google/supersonic" rel="nofollow">https://github.com/google/supersonic</a>

评论 #11126138 未加载

rchover 9 years ago

In-memory only... How is this better than the SFrame implementation from Dato (2015) that was posted here a couple of days ago?<a href="https://news.ycombinator.com/item?id=11106501" rel="nofollow">https://news.ycombinator.com/item?id=11106501</a>

评论 #11120266 未加载

rodionosover 9 years ago

I had a hard time parsing out the technical bits behind the facade of titles and endorsements. I didn't know you could be a Vice President of an open source product. Two roles I'm familiar with: committer and PMC chair. Now VPs. Just saying...

评论 #11120172 未加载

评论 #11120232 未加载

agentgtover 9 years ago

Not to completely denigrate Apache but I have found contributing to Apache projects (not the license but real apache projects) to be somewhat of a hassle. They generally do not do PRs but rather patches via email or attached to bugs (perhaps some do but I have yet to see one that does), require signatures for contribution, and JIRA is really getting slow these days.I'm somewhat ignorant as I don't run any Apache projects but I'm curious as to why people choose Apache to back their project these days. I guess why choose a committee instead of just leaving it on Github. I suppose its the whole voting and board stuff.

评论 #11122061 未加载

ccleveover 9 years ago

Could someone explain the difference between this and Avro or Parquet? Do they serve the same purpose?

评论 #11119273 未加载

burembaover 9 years ago

Nice to see a new columnar data format alternative. Just a quick question though.The existing columnar data formats such as Parquet and ORC aim to be space-efficient since the data is stored in disk and IO operations are usually the bottleneck. The columnar data formats shine in big-data area so the amount of data will be huge. Given that columnar data formats can be compressed efficiently and that's of the main points of columnar data formats such as Parquet and ORC, I'm not sure that I understand the main point of in-memory columnar data formats.Once the data is in-memory and we can access any column of a row in constant-time what's the difference between a row-oriented data format and columnar data format?

评论 #11122883 未加载

评论 #11120362 未加载

评论 #11121696 未加载

david-givenover 9 years ago

Is it streamable? Could I use this as an intermediate format to send columnar data between two processes via a pipe?

评论 #11120615 未加载

tycho01about 9 years ago

I'm interpreting this as saying they want to use the same representation to use in memory (for querying) and for 'serialization' (sending the same thing over the wire). This begs the question why separate serialized representations ever became a thing in the first place.My understanding is that serialization became a thing because in-memory representations tend to use pointers to shared data structures that may thus be referenced multiple times while being stored only once. This would not translate 1:1 to serialized representations (where memory offsets would no longer hold meaning) -- much less in any language-agnostic way.So I have this suspicion that Apache Arrow would not support reusing duplicate data while storing it only once. Would anyone mind clarifying on this point?

NovaXover 9 years ago

"Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."How will Arrow use vectorized instructions on the JVM? That seems to be only available to the JIT and JNI, which is a frustrating limitation.

crudbugover 9 years ago

"All systems utilize the same memory format"Will Cassandra Java drivers support this ?

jkestelynover 9 years ago

More technical details available here:<a href="http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/" rel="nofollow">http://blog.cloudera.com/blog/2016/02/introducing-apache-arr...</a>

12 comments

bcoatesover 9 years ago

评论 #11118762 未加载

评论 #11118845 未加载

评论 #11118856 未加载

评论 #11123998 未加载

xtacyover 9 years ago

评论 #11126138 未加载

rchover 9 years ago

评论 #11120266 未加载

rodionosover 9 years ago

评论 #11120172 未加载

评论 #11120232 未加载

agentgtover 9 years ago

评论 #11122061 未加载

ccleveover 9 years ago

Could someone explain the difference between this and Avro or Parquet? Do they serve the same purpose?

评论 #11119273 未加载

burembaover 9 years ago

评论 #11122883 未加载

评论 #11120362 未加载

评论 #11121696 未加载

david-givenover 9 years ago

Is it streamable? Could I use this as an intermediate format to send columnar data between two processes via a pipe?

评论 #11120615 未加载

tycho01about 9 years ago

NovaXover 9 years ago

crudbugover 9 years ago

"All systems utilize the same memory format"Will Cassandra Java drivers support this ?

jkestelynover 9 years ago

Apache Arrow: A new open source in-memory columnar data format

12 comments

Apache Arrow: A new open source in-memory columnar data format

12 comments