Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, more wants to be a language for talking about execution rather than a full on optimization/execution engine. <a href="https://github.com/substrait-io/substrait">https://github.com/substrait-io/substrait</a> .<p>(Edit: ah, there's a recent talk discussing PyVelox trying to get Substrait integration. <a href="https://www.youtube.com/watch?v=l_kHxkGkNRg#t=18m22s" rel="nofollow">https://www.youtube.com/watch?v=l_kHxkGkNRg#t=18m22s</a> . However there's also discussion about the un-maintainedness of some of the current Substrait work here; unclear status. <a href="https://github.com/facebookincubator/velox/issues/8895">https://github.com/facebookincubator/velox/issues/8895</a>)<p>We can also see from the Apache Arrow DataFusion discussion that they too see themselves as a bit of a Velox competitor. <a href="https://github.com/apache/arrow-datafusion/discussions/6441">https://github.com/apache/arrow-datafusion/discussions/6441</a><p>It's cool to see this space mature. I like that even Velox sees that Apache Arrow (underlying Apache Arrow DataFusion too) is industry standard tech that they ought work with. <a href="https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/" rel="nofollow">https://engineering.fb.com/2024/02/20/developer-tools/velox-...</a><p>Theres a solid Influx post talks to some of how they are composing the assorted technologies to build they next gen 3.0, which I find helpful for getting a sense of how all the pieces of a modern high-performance data engine slot together. <a href="https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/" rel="nofollow">https://www.influxdata.com/blog/flight-datafusion-arrow-parq...</a>
My general take is that while the idea of composability is good, the implementations of these things are just frankly not of high quality. Velox/Acero in particular are all plagued by what I've come to call "Java syndrome", where everything is written as idiomatic Java but with C++ syntax. Virtual methods, std::shared_ptr galore (in lieu of garbage collection), random heap allocations, etc. As a result these systems tend to be bloated and significantly slower than they need to be.<p>DuckDB is good though, and I predict its quality of implementation will keep "monolithic databases" relevant for a while longer.
Velox could be competitor of datafusion. It is more focus on execution engine and could be great to integrate to other high performance databases.<p>Database will be split into pieces and rebuild!
I wonder how many of this sort of FAANG project really get used where they are built. I went for an interview at a FAANG years ago to work on a very big consumer product (when it was in relative infancy) and expected to find a hyper tech data backend to use... they told me that they were using mySQL.<p>I didn't get the job so maybe they were just joking around with me - but the general despair that they evinced about their data situation makes me wonder!
A thread from late 2022: <a href="https://news.ycombinator.com/item?id=32673873">https://news.ycombinator.com/item?id=32673873</a>
To the best of my knowledge, Meta has significantly reduced its investment in the Velox project. Apart from Meta, I'm not aware of any other major company that really uses Velox in a production environment. Frankly speaking, Velox may have already missed the window of opportunity for rapid development. If you're looking for a vectorized execution engine, you could consider ClickHouse (www.clickhouse.com) or StarRocks (www.starrocks.io). If your data analysis scenarios require more multi-table join operations, StarRocks is clearly a better choice.
Many ideas look like they were influenced by ClickHouse, and some are direct copies. I'm surprised they didn't provide references to ClickHouse, where the implementations are proven in production in the first place.