TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Git for datasets and config table versioning, that commits diffs

1 点作者 eliomattia大约 2 年前
I am building the Data Manager to version datasets and configuration tables in a storage-efficient way, and easily identify and deploy to S3 datasets versions to feed other code. It works on top of git for versioning but calculates and commits incremental differences only, locally and in the cloud. Committing diffs can enable collaborating on huge repositories without full checkouts for certain use cases, using only a logical checkout of a few kilobytes, and letting other machines merge your contributions into branches.<p>D:\install\dir\dm&gt;dm<p>will: * make sure the Data Manager is in sync with the Git HEAD * process the data pipelines configured in data-manager-config.json in the installation folder * for each source dataset calculate the diffs against the state represented by the HEAD * commit those diffs in a readable format, that the Data Manager can also parse * build snapshots and post them to S3 if configured<p>The installation .msi comes with sample \datasets and running dm.exe will automatically create sample \repos. Supported data sources: CSV, xlsx. You can create snapshots by tagging commits (there is a customizable “api_” tag prefix filter by default). Snapshots, identified by tag_name:commit_sha, can be posted to S3. Heavy files beyond a custom threshold will also be posted to S3, if configured, and referenced indirectly in the repo. The current best use case is for multiple datasets of a couple of gigabytes each and daily changes.<p>You need to have git installed and available in PATH (git --version) and you need to grant permissions with your antivirus and flag the executable (dm.exe) as trusted. Current usage constraints include: data must be structured and tabular, no dataset primary key changes allowed (there’s a workaround), merge features are work in progress. This early prototype will replay history using a naive algorithm.<p>Related post: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35806843" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35806843</a><p>AWS configuration in C:\Users\&lt;username&gt;\.aws\ with two files, (1) config and (2) credentials, no file extension.<p>(1) config content: [default] region=us-west-1<p>(2) credentials content: [default] aws_access_key_id=AKIA... aws_secret_access_key=wJalrXU...<p>S3 bucket name in data-manager-config.json, the bucket should be available in the configured region and accessible using the provided access key. { &quot;s3&quot;: { &quot;default-bucket&quot;: &quot;mybucket&quot; ... } }

暂无评论

暂无评论