Hello HN!,<p>After having it lingering in a private repo for a while, I decided to open source Snapdir, a reasonably documented and tested set of bash scripts for creating, sharing, and verifying snapshots of directories and their contents using human-readable manifests.<p>While I created Snapdir to solve a specific problem around ML pipeline reproducibility and data distribution, I've also found it very useful for vendoring dependencies such as VM AMIs, binaries, and container images in my infrastructure-as-code environments.<p>The manifest format should be easy to understand and kept under version control for auditing purposes. If you want to incorporate it into your workflows, you can use the <i>snapdir-manifest</i> script in isolation.<p>The project is still in its early stages, and the current bash version should be considered a proof of concept. You can use the tests included in the bash version to validate implementations in other languages.<p>For a quick test, you can play with it from the 5MB docker image:<p><pre><code> docker run -it --rm bermi/snapdir help
</code></pre>
and to check a sample manifest<p><pre><code> docker run -it --rm bermi/snapdir manifest /lib
</code></pre>
I hope you find it helpful, and I would love to hear your feedback and PRs!
This looks really useful for managing datasets in things like finance or AI where we need to be able to show data provenance for regulatory or reproducibility reasons. Since it's implemented in Bash, I'd be curious about performance on larger datasets. I wonder if there would be any scalability improvements in moving to something like C or Carbon?