科技回声

1 comment

eliomattia大约 2 年前

I just found this blog post. It seems Palantir Foundry, which does not come up often when researching git for data tools, includes a version control system for datasets that stores diffs in their own cloud-based filesystem. According to the author, one of the founding engineers of the platform, diffs are:> particularly useful for append-only datasets of immutable records such as system logs or sensor readings which are often among the largest (and fastest-growing) datasets our customers useDiffs seem to consist of additional files in separate folders:> behind the scenes we effectively store each diff in a separate folder in the backing file system (e.g.,datasetA/diff1, datasetA/diff2, …) so that the whole dataset is simply represented by datasetA/*.Without exposing technicalities, the author suggests that the delete use case is taken care of logically and not physically, since datasetA/* may not reflect the actual whole dataset. I infer that they might be logging changes under the hood in a Git-like fashion.> It’s a bit more complicated than this because users can selectively delete files from those diffsHowever, it seems that the versioning raw data they manage are not available to clients or users directly:> a simple request that we frequently get from our customers: “can we export our datasets from Palantir Foundry to our existing data lake or S3 bucket?“ While this is of course possible, it is important to understand that such exported datasets lack precisely those versioning and sandboxing features that make Foundry a great tool for collaborative data engineering.This could be a mechanism for vendor lock-in, tied to the very important ACID guarantees of their implementation.I came across their post while doing research on existing solutions for dataset versioning. Some extra background here: <a href="https://news.ycombinator.com/item?id=35930895" rel="nofollow">https://news.ycombinator.com/item?id=35930895</a>

评论 #35967783 未加载

1 comment

eliomattia大约 2 年前

评论 #35967783 未加载

Palantir Foundry's dataset version control, a diff-based Git for data

1 comment

Palantir Foundry's dataset version control, a diff-based Git for data

1 comment