科技回声

11 条评论

CharlesW大约 1 年前

I learned that "Nimble" is the new name for "Alpha", discussed in this 2023 report: <a href="https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf" rel="nofollow">https://www.cidrdb.org/cidr2023/papers/p77-chattopadhyay.pdf</a>Here's an excerpt that may save some folks a click or three…> "While storing analytical and ML tables together in the data lakehouse is beneficial from a management and integration perspective, it also imposes some unique challenges. For example, it is increasingly common for ML tables to outgrow analytical tables by up to an order of magnitude. ML tables are also typically much wider, and tend to have tens of thousands of features usually stored as large maps.> "As we executed on our codec convergence strategy for ORC, it gradually exposed significant weaknesses in the ORC format itself, especially for ML use cases. The most pressing issue with the DWRF format was metadata overhead; our ML use cases needed a very large number of features (typically stored as giant maps), and the DWRF map format, albeit optimized, had too much metadata overhead. Apart from this, DWRF had several other limitations related to encodings and stripe structure, which were very difficult to fix in a backward-compatible way. Therefore, we decided to build a new columnar file format that addresses the needs of the next generation data stack; specifically, one that is targeted from the onset towards ML use cases, but without sacrificing any of the analytical needs.> "The result was a new format we call Alpha. Alpha has several notable characteristics that make it particularly suitable for mixed Analytical nd ML training use cases. It has a custom serialization format for metadata that is significantly faster to decode, especially for very wide tables and deep maps, in addition to more modern compression algorithms. It also provides a richer set of encodings and an adaptive encoding algorithm that can smartly pick the best encoding based on historical data patterns, through an encoding history loopback database. Alpha requires fewer streams per column for many common data types, making read coalescing much easier and saving I/Os, especially for HDDs. Alpha was written in modern C++ from scratch in a way that allows it to be extended easily in the future.> "Alpha is being deployed in production today for several important ML training applications and showing 2-3x better performance than ORC on decoding, with comparable encoding performance and file size."

评论 #39995776 未加载

评论 #39997103 未加载

评论 #40029077 未加载

RyanHamilton大约 1 年前

Parquet + Arrow hopefully seem to be emerging as standards. I would much rather see those standards improved than new formats emerge. Even within those existing formats there has become enough variation than some platforms only support a subset of functionality. That and the performance and size of the libraries is poor.e.g. DuckDB / Clickhouse Parquet nanosecond compatibility. <a href="https://github.com/duckdb/duckdb/issues/9852">https://github.com/duckdb/duckdb/issues/9852</a> e.g. The arrow SQL driver is 70+MB in java.

评论 #40024237 未加载

jauntywundrkind大约 1 年前

There's already been some interesting column format optimization work at Meta, as their Velox execution engine team worked with Apache Arrow to align their columnar formats. This talk is actually happening at VeloxCon, so there's got to be some awareness! <a href="https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/" rel="nofollow">https://engineering.fb.com/2024/02/20/developer-tools/velox-...</a> <a href="https://news.ycombinator.com/item?id=39454763">https://news.ycombinator.com/item?id=39454763</a>I wonder how much if any overlap there is here, and whether it was intentional or accidentally similar. Ah, "return efficient Velox vectors" is on the list, but still seems likely to be some overlap in encoding strategies etc.The four main points seem to be: a) encoding metadata as part of stream rather than fixed metadata, b) nls are just another encoding, c) no stripe footer/only stream locations is in footer, d) FlatBuffers! Shout out to FlatBuffers, wasn't expecting to see them making a comeback!I do wish there were a lot more diagrams/slides. There's four bullet points, and Yoav Helfman talks to them, but there's not a ton of showing what he's talking about.

评论 #39999417 未加载

评论 #39997672 未加载

horusporus大约 1 年前

I was really hoping to see Cap'N Proto used for the format, since that has fast access without decoding, and reasonable backwards compatibility with old files. Anyone know why Flatbuffers were used?

mempko大约 1 年前

I would love to see support in Apache Arrow to read this format. Parquet is already supported.

评论 #39996944 未加载

MaximilianEmel大约 1 年前

Is there a quick description of the structure of it anywhere?

评论 #39996939 未加载

mrtimo大约 1 年前

How is this compare with parquet format?

Kalanos大约 1 年前

By the time data has been preprocessed for ML, it is numerically encoded as floats, so .npy/npz is a good fit and `np.memmap` is an incredible way to seek into ndim data.

yigitkonur35大约 1 年前

Curious about Clickhouse’s approach to this compression structure.

gigatexal大约 1 年前

Hmm another conte for in the open table format space. Nice.

评论 #40003565 未加载

khaledh大约 1 年前

Fwiw, the name clashes with Nim's package manager nimble: <a href="https://github.com/nim-lang/nimble">https://github.com/nim-lang/nimble</a>

评论 #39995679 未加载

11 条评论

CharlesW大约 1 年前

评论 #39995776 未加载

评论 #39997103 未加载

评论 #40029077 未加载

RyanHamilton大约 1 年前

评论 #40024237 未加载

jauntywundrkind大约 1 年前

评论 #39999417 未加载

评论 #39997672 未加载

horusporus大约 1 年前

I was really hoping to see Cap'N Proto used for the format, since that has fast access without decoding, and reasonable backwards compatibility with old files. Anyone know why Flatbuffers were used?

mempko大约 1 年前

I would love to see support in Apache Arrow to read this format. Parquet is already supported.

评论 #39996944 未加载

MaximilianEmel大约 1 年前

Is there a quick description of the structure of it anywhere?

评论 #39996939 未加载

mrtimo大约 1 年前

How is this compare with parquet format?

Kalanos大约 1 年前

By the time data has been preprocessed for ML, it is numerically encoded as floats, so .npy/npz is a good fit and `np.memmap` is an incredible way to seek into ndim data.

yigitkonur35大约 1 年前

Curious about Clickhouse’s approach to this compression structure.

gigatexal大约 1 年前

Hmm another conte for in the open table format space. Nice.

评论 #40003565 未加载

khaledh大约 1 年前

Fwiw, the name clashes with Nim's package manager nimble: <a href="https://github.com/nim-lang/nimble">https://github.com/nim-lang/nimble</a>

评论 #39995679 未加载

Nimble: A new columnar file format by Meta [video]

11 条评论

Nimble: A new columnar file format by Meta [video]

11 条评论