TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Palantir Foundry's dataset version control, a diff-based Git for data

3 点作者 eliomattia大约 2 年前

1 comment

eliomattia大约 2 年前
I just found this blog post. It seems Palantir Foundry, which does not come up often when researching git for data tools, includes a version control system for datasets that stores diffs in their own cloud-based filesystem. According to the author, one of the founding engineers of the platform, diffs are:<p>&gt; particularly useful for append-only datasets of immutable records such as system logs or sensor readings which are often among the largest (and fastest-growing) datasets our customers use<p>Diffs seem to consist of additional files in separate folders:<p>&gt; behind the scenes we effectively store each diff in a separate folder in the backing file system (e.g.,datasetA&#x2F;diff1, datasetA&#x2F;diff2, …) so that the whole dataset is simply represented by datasetA&#x2F;*.<p>Without exposing technicalities, the author suggests that the delete use case is taken care of logically and not physically, since datasetA&#x2F;* may not reflect the actual whole dataset. I infer that they might be logging changes under the hood in a Git-like fashion.<p>&gt; It’s a bit more complicated than this because users can selectively delete files from those diffs<p>However, it seems that the versioning raw data they manage are not available to clients or users directly:<p>&gt; a simple request that we frequently get from our customers: <i>“can we export our datasets from Palantir Foundry to our existing data lake or S3 bucket?“</i> While this is of course possible, it is important to understand that such exported datasets lack precisely those versioning and sandboxing features that make Foundry a great tool for collaborative data engineering.<p>This could be a mechanism for vendor lock-in, tied to the very important ACID guarantees of their implementation.<p>I came across their post while doing research on existing solutions for dataset versioning. Some extra background here: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35930895" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35930895</a>
评论 #35967783 未加载