Greetings and thank you for reading!<p>I'm involved in several projects where we process very large amounts of data and generate secondary data, which we then organize on disk for downstream analysis. These are very large files and datasets (files are a few GB to several 100GB). Each dataset can be organized into one directory, and each of these would be documented as a whole.<p>I would like to implement a process where we can ensure that the data is well documented (for humans and machines). Right now, we are working on low-hanging fruit:<p>1. Consistent directory structure(s)
2. Generating file hashes
3. Docs/metadata (human: markdown, machine: json), including "version", dates, data provenance, protocols, etc...
4. Machine-readable docs (json), like above
5. Simple file manifest (file, hash)<p>It would be great to get any kind of feedback, suggestions, as well as pitfalls to avoid. Cheers!<p>(First post here, hopefully I'm doing this correctly)