TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Are there best practices for unstructured data versioning and metadata

1 pointsby deadflatover 1 year ago
Greetings and thank you for reading!<p>I&#x27;m involved in several projects where we process very large amounts of data and generate secondary data, which we then organize on disk for downstream analysis. These are very large files and datasets (files are a few GB to several 100GB). Each dataset can be organized into one directory, and each of these would be documented as a whole.<p>I would like to implement a process where we can ensure that the data is well documented (for humans and machines). Right now, we are working on low-hanging fruit:<p>1. Consistent directory structure(s) 2. Generating file hashes 3. Docs&#x2F;metadata (human: markdown, machine: json), including &quot;version&quot;, dates, data provenance, protocols, etc... 4. Machine-readable docs (json), like above 5. Simple file manifest (file, hash)<p>It would be great to get any kind of feedback, suggestions, as well as pitfalls to avoid. Cheers!<p>(First post here, hopefully I&#x27;m doing this correctly)

no comments

no comments