I just found this blog post. It seems Palantir Foundry, which does not come up often when researching git for data tools, includes a version control system for datasets that stores diffs in their own cloud-based filesystem. According to the author, one of the founding engineers of the platform, diffs are:<p>> particularly useful for append-only datasets of immutable records such as system logs or sensor readings which are often among the largest (and fastest-growing) datasets our customers use<p>Diffs seem to consist of additional files in separate folders:<p>> behind the scenes we effectively store each diff in a separate folder in the backing file system (e.g.,datasetA/diff1, datasetA/diff2, …) so that the whole dataset is simply represented by datasetA/*.<p>Without exposing technicalities, the author suggests that the delete use case is taken care of logically and not physically, since datasetA/* may not reflect the actual whole dataset. I infer that they might be logging changes under the hood in a Git-like fashion.<p>> It’s a bit more complicated than this because users can selectively delete files from those diffs<p>However, it seems that the versioning raw data they manage are not available to clients or users directly:<p>> a simple request that we frequently get from our customers: <i>“can we export our datasets from Palantir Foundry to our existing data lake or S3 bucket?“</i> While this is of course possible, it is important to understand that such exported datasets lack precisely those versioning and sandboxing features that make Foundry a great tool for collaborative data engineering.<p>This could be a mechanism for vendor lock-in, tied to the very important ACID guarantees of their implementation.<p>I came across their post while doing research on existing solutions for dataset versioning. Some extra background here: <a href="https://news.ycombinator.com/item?id=35930895" rel="nofollow">https://news.ycombinator.com/item?id=35930895</a>