I learned that "Nimble" is the new name for "Alpha", discussed in this 2023 report: <a href="https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf" rel="nofollow">https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf</a><p>Here's an excerpt that may save some folks a click or three…<p>> <i>"While storing analytical and ML tables together in the data lakehouse is beneficial from a management and integration perspective, it also imposes some unique challenges. For example, it is increasingly common for ML tables to outgrow analytical tables by up to an order of magnitude. ML tables are also typically much wider, and tend to have tens of thousands of features usually stored as large maps.</i><p>> <i>"As we executed on our codec convergence strategy for ORC, it gradually exposed significant weaknesses in the ORC format itself, especially for ML use cases. The most pressing issue with the DWRF format was metadata overhead; our ML use cases needed a very large number of features (typically stored as giant maps), and the DWRF map format, albeit optimized, had too much metadata overhead. Apart from this, DWRF had several other limitations related to encodings and stripe structure, which were very difficult to fix in a backward-compatible way. Therefore, we decided to build a new columnar file format that addresses the needs of the next generation data stack; specifically, one that is targeted from the onset towards ML use cases, but without sacrificing any of the
analytical needs.</i><p>> <i>"The result was a new format we call Alpha. Alpha has several notable characteristics that make it particularly suitable for mixed Analytical nd ML training use cases. It has a custom serialization format for metadata that is significantly faster to decode, especially for very wide tables and deep maps, in addition to more modern compression algorithms. It also provides a richer set of encodings and an adaptive encoding algorithm that can smartly pick the best encoding based on historical data patterns, through an encoding history loopback database. Alpha requires fewer streams per column for many common data types, making read coalescing much easier and saving I/Os, especially for HDDs. Alpha was written in modern C++ from scratch in a way that allows it to be extended easily in the future.</i><p>> <i>"Alpha is being deployed in production today for several important ML training applications and showing 2-3x better performance than ORC on decoding, with comparable encoding performance and file size."</i>