One of the lead Arrow developers here (<a href="https://github.com/wesm" rel="nofollow">https://github.com/wesm</a>). It's a little bit disappointing for me to see the Arrow project scrutinized through the one-dimensional lens of columnar storage for database systems -- i.e. considering Arrow to be an alternative technology (i.e. part of the same category of technologies) to Parquet and ORC.<p>The reality (at least from my perspective, which is more informed by the data science and non-analytic database world) is that Arrow is a new, category defining technology. So you could choose to use it as a main memory storage format as an alternative for Parquet/ORC if you wanted, but that would be one possible use case for the technology, not the primary one.<p>What's missing from the article is the role and relationship between runtime memory formats and storage formats, and the costs of serialization and data interchange, particularly between processes and analytics runtimes. There are also some small factual errors about Arrow in the article (for example, Arrow record batches are not limited to 64K in length).<p>I will try to write a lengthier blog post on wesmckinney.com when I can going deeper into these topics to help provide some color for onlookers.
I'm also a developer on Arrow (<a href="https://github.com/jacques-n/" rel="nofollow">https://github.com/jacques-n/</a>), similar to WesM. It is always rewarding (and also sometimes challenging) to hear how people understand or value something you're working on.<p>I think Dan's analysis is evaluating Arrow from one particular and fairly constrained perspective of "if using Arrow and Parquet for RDBMS purposes, should they exist separately". I'm glad that Dan comes to a supportive conclusion even with a pretty narrow set of criteria.<p>If you broaden the criteria to all the different reasons people are consuming/leveraging/contributing to Arrow, the case only becomes more clear for its existence and use. As someone who uses Arrow extensively in my own work and professionally (<a href="https://github.com/dremio/dremio-oss" rel="nofollow">https://github.com/dremio/dremio-oss</a>), I find many benefits including two biggies: processing speed AND interoperability (now two different apps can share in-memory data without serialization/deserialization or duplicate memory footprint). And best of all, the community is composed of collaborators trying to solve similar problems, etc. When you combine all of these, Arrow is a no brainer as an independent community and is developing quickly because of that (80+ contributors, Many language bindings (6+) and more than 1300 github stars in just a short amount of time).
The Apache Software Foundation has no problem with hosting competing projects. <i>There is no top-level technical direction by the Board or any other entity.</i> There's no "steering committee" or the like where companies pay to play and push the organization to pump up their preferred project.<p>This is one of the fundamental reasons that the ASF is successful. Any time you see the ASF criticized for hosting competing projects without addressing this point, feel free to dismiss the critique as facile and uninformed.
This point is interesting:<p><i>So indeed, it does not matter whether the data is stored on disk or in memory --- column-stores are a win for these types of workloads. However, the reason is totally different.</i><p>He says that for tables on disk, column layout is a win due to less data transferred from disk. But for tables in memory, the win is due to the fact that you can vectorize operations on adjacent vaules.<p>I also think the convergence of these two approaches is interesting: one is from a database POV; the other is from a programming language POV.<p>Column Databases came out of RDBMS research. They want to speed up SQL-like queries.<p>Apache Arrow / Pandas came at it from the point of view of R, which descended from S [1]. It's basically an algebra for dealing with scientific data in an interactive programming language. In this view, rows are observations and columns are variables.<p>It does seem to make sense for there to be two separate projects now, but I wonder if eventually they will converge. Probably one big difference is what write operations are supported, if any. R and I think Pandas do let you mutate a value in the middle of a column.<p>-----<p>On another note, I have been looking at how to implement data frames. I believe the idea came from S, which dates back to 1976, but I only know of a few open source implementations:<p>- R itself -- the code is known to be pretty old and slow, although rich with functionality.<p>- Hadley Wickham's dplyr (the tbl type is an enhanced than data.frame). C++.<p>- The data.table package in R. In C.<p>- Pandas. In a mix of Python, Cython, and C.<p>- Apache Arrow. In C++.<p>- Julia DataFrames.jl.<p>If anyone knows of others, I'm interested.<p>[1] <a href="https://en.wikipedia.org/wiki/S_(programming_language)" rel="nofollow">https://en.wikipedia.org/wiki/S_(programming_language)</a>
A pretty glaring flaw or omission in the analysis is using a table that's just 6 columns wide. Tables used in data analytics workloads are much wider, 50-100 columns or more is common. That number of columns means scanning significantly more data for the row-oriented storage.
I would argue that the fact that Arrow has been integrated into so many projects over the last year is proof that a separate project made sense. Dremio, PySpark, Pandas, MapD, NVIDIA GPU Data Frame, Graphistry, Turbodbc, ...
FWIW it should be possible to vectorize the search across the row store. With 24 byte tuples (assuming no inter-row padding) you can fit 2.6 rows into a 64 byte cache line (a 512 bit simd register). Then it's just a matter of proper bit masking and comparison. Easier said than done, I figure because that remainder is going to be a pain. Another approach is to use gather instructions to load the first column of each row and effectively pack the first column into a register as if it were loaded from a row store and then do a vectorized compare as in the column-store case.<p>All of that to underscore it's not that one format vectorizes and the other doesn't. The key takeaway here is that with the column store, the compiler can <i>automatically</i> vectorize. This is especially a bonus for JVM based languages because afaik there is no decent way to hand-roll SIMD code without crossing a JNI boundary.
I've built a columnar in-memory data transformation engine [1] and wrote [2] about the need in common columnar data format to avoid re-compression. The problem I have with the Apache projects is that they all require strongly typed columns (if I'm not missing something). In our case, the app is actively used for processing spreadsheets and sometimes XML files which requires supporting mixed data types (e.g. timestamps and strings) in a column.<p>Another issue is storing 128-bit decimals. They are more preferable for financial calculations, but are not supported in the Apache projects.<p>So may be we need a forth standard for columnar data representation. Or expand the existing ones to make them more versatile.<p>[1] <a href="http://easymorph.com/in-memory-engine.html" rel="nofollow">http://easymorph.com/in-memory-engine.html</a><p>[2] <a href="http://bi-review.blogspot.ca/2015/06/the-world-needs-open-source-columnar.html" rel="nofollow">http://bi-review.blogspot.ca/2015/06/the-world-needs-open-so...</a>
It's an interesting article but surely a modern instance should be able to manage more than 4 u32 compares per cycle? Any machine with AVX2 (or beyond) should either be able to do 2x256 AVX2 VPCMPEQD instructions (or AVX 512 can do 1x512). The code to marshal up the results of this compare and do something useful with would push out to another cycle or two but IMO we can surely manage to compare an entire cache line (64 bytes) in 1 cycle worth of work.<p>This doesn't invalidate his point, and there are even more interesting things that can be done with packed data formats - and of course if you're searching a 6-bit column for some data item (or set of data items) you might be even happier to be using a columnar form (especially if the row contains lots of stuff: the 6:1 ratio in the article might be understating the advantage of working at the column level).
I don't have any strong feelings about this either way, but the main question in my head after reading this is:<p>- Are the differences SIGNIFICANT enough to support two (three?) different codebases? A lot of your points seem to be more about building in configurations/plugins versus a completely different (core) storage format. For instance, adding in separate plugins for different dictionary compressions, or being able to specify the block read size or schemes. I would just think you may be spending a lot of time reinventing 80% of the wheel and 20% on 'memory specific' functionality.<p>(I'm naively oversimplifying, but its something to think about).
If the CPU test is simply reading 1/6 of the data, then it should be memory that is the bottleneck and not the CPU (even unoptimized). Something smells very wrong about his adhoc test. The next piece of data is in the cache-line seven times out of eight. And its reading 1/6 of the data into memory. Should be way faster even without -O3. And if there's something fundamentally broken about the code without -O3, then <i>why post that benchmark at all</i>? Seems dishonest.<p>Be good to post the code when making such claims.
Columnar data representation can be optimized and designed for a wild range of query and data types that have non-aligned constraints. So yes, you can have 3, or 12 for that matter. I think we'll see more & more purpose-specific database technologies as gains from Moore's laws and its analogues start slowing down, and some previously niche application areas become high-$ areas with larger user bases.
Not sure if this is a good place to ask, but how do Apache Arrow and Parquet compare to Apache Kudu (<a href="https://kudu.apache.org/" rel="nofollow">https://kudu.apache.org/</a>)? Seems like all three are columnar data solutions, but it's not clear when you'd use one over the other.<p>Kind of surprised the article didn't mention Kudu for that matter.
> However, a modern CPU processor runs at approximately 3 GHz --- in other words they can process around 3 billion instructions a second. So even if the processor is doing a 4-byte integer comparison every single cycle, it is processing no more than 12GB a second<p>This is so laughably simplified.
Does arrow itself supports and optimizes for multicore processing? Or is that a responsibility of a higher layer like dremio? If so, do such layers optimize Arrow queries execution to utilize multiple cores as much as possible?
I wonder if one could have an FPGA attached to the CPU, load a piece of code into it for pipelined decompressing and processing a piece of compressed column store, and then vroom! It would process data really fast.
How can I query Apache Arrow without entire Hadoop stack? It seems it could be a great in-memory OLAP engine, if only there was an efficient way to slice and dice it?
Well, there is another one here: <a href="https://carbondata.apache.org/" rel="nofollow">https://carbondata.apache.org/</a>
Let me give benefit of the doubt instead of simply doubting -<p>Can somebody provide a justification why performance-centric implementation-details justify an entire new project? Couldn't this be done as simply a storage engine? For that matter, couldn't all columnar datastores merely be storage engines?
I think the author comes across as tone deaf for two reasons:<p>1. Burying the lede, by tacking the affirmative conclusion on as the final sentence,<p>2. This quote reads as pompous, not humble:<p>> I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression). I also expect that they will read the X100 compression paper which includes schemes which can be decompressed using vectorized processing.