TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Apache Arrow: A new open source in-memory columnar data format

170 pointsby jkestelynover 9 years ago

12 comments

bcoatesover 9 years ago
If you don&#x27;t speak press-release, this is a cool project to create an in-memory interop format for columnar data, so various tools can share data without a serialize&#x2F;deserialize step which is very expensive compared to copying or sharing memory on the same machine or within a local network.<p><a href="https:&#x2F;&#x2F;git-wip-us.apache.org&#x2F;repos&#x2F;asf?p=arrow.git;a=blob;f=format&#x2F;Layout.md;h=c393163bf894bab283641882d9aa4a8c2ef0ef8e;hb=HEAD" rel="nofollow">https:&#x2F;&#x2F;git-wip-us.apache.org&#x2F;repos&#x2F;asf?p=arrow.git;a=blob;f...</a><p>(edited post because I fail reading git and didn&#x27;t notice the java implementation)
评论 #11118762 未加载
评论 #11118845 未加载
评论 #11118856 未加载
评论 #11123998 未加载
xtacyover 9 years ago
Nice initiative. Cheap serde and cross-language compatibility with an eye towards data scan intensive workloads is an important component!<p>Have you folks considered Supersonic engine from Google, which was designed with similar (but not as extensive as Arrow) goals in mind?<p><a href="https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;supersonic" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;supersonic</a>
评论 #11126138 未加载
rchover 9 years ago
In-memory only... How is this better than the SFrame implementation from Dato (2015) that was posted here a couple of days ago?<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=11106501" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=11106501</a>
评论 #11120266 未加载
rodionosover 9 years ago
I had a hard time parsing out the technical bits behind the facade of titles and endorsements. I didn&#x27;t know you could be a Vice President of an open source product. Two roles I&#x27;m familiar with: committer and PMC chair. Now VPs. Just saying...
评论 #11120172 未加载
评论 #11120232 未加载
agentgtover 9 years ago
Not to completely denigrate Apache but I have found contributing to Apache projects (not the license but real apache projects) to be somewhat of a hassle. They generally do not do PRs but rather patches via email or attached to bugs (perhaps some do but I have yet to see one that does), require signatures for contribution, and JIRA is really getting slow these days.<p>I&#x27;m somewhat ignorant as I don&#x27;t run any Apache projects but I&#x27;m curious as to why people choose Apache to back their project these days. I guess why choose a committee instead of just leaving it on Github. I suppose its the whole voting and board stuff.
评论 #11122061 未加载
ccleveover 9 years ago
Could someone explain the difference between this and Avro or Parquet? Do they serve the same purpose?
评论 #11119273 未加载
burembaover 9 years ago
Nice to see a new columnar data format alternative. Just a quick question though.<p>The existing columnar data formats such as Parquet and ORC aim to be space-efficient since the data is stored in disk and IO operations are usually the bottleneck. The columnar data formats shine in big-data area so the amount of data will be huge. Given that columnar data formats can be compressed efficiently and that&#x27;s of the main points of columnar data formats such as Parquet and ORC, I&#x27;m not sure that I understand the main point of in-memory columnar data formats.<p>Once the data is in-memory and we can access any column of a row in constant-time what&#x27;s the difference between a row-oriented data format and columnar data format?
评论 #11122883 未加载
评论 #11120362 未加载
评论 #11121696 未加载
david-givenover 9 years ago
Is it streamable? Could I use this as an intermediate format to send columnar data between two processes via a pipe?
评论 #11120615 未加载
tycho01about 9 years ago
I&#x27;m interpreting this as saying they want to use the same representation to use in memory (for querying) and for &#x27;serialization&#x27; (sending the same thing over the wire). This begs the question why separate serialized representations ever became a thing in the first place.<p>My understanding is that serialization became a thing because in-memory representations tend to use pointers to shared data structures that may thus be referenced multiple times while being stored only once. This would not translate 1:1 to serialized representations (where memory offsets would no longer hold meaning) -- much less in any language-agnostic way.<p>So I have this suspicion that Apache Arrow would not support reusing duplicate data while storing it only once. Would anyone mind clarifying on this point?
NovaXover 9 years ago
&quot;Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing.&quot;<p>How will Arrow use vectorized instructions on the JVM? That seems to be only available to the JIT and JNI, which is a frustrating limitation.
crudbugover 9 years ago
&quot;All systems utilize the same memory format&quot;<p>Will Cassandra Java drivers support this ?
jkestelynover 9 years ago
More technical details available here:<p><a href="http:&#x2F;&#x2F;blog.cloudera.com&#x2F;blog&#x2F;2016&#x2F;02&#x2F;introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard&#x2F;" rel="nofollow">http:&#x2F;&#x2F;blog.cloudera.com&#x2F;blog&#x2F;2016&#x2F;02&#x2F;introducing-apache-arr...</a>