Hi HN! We are the DVC team. We are releasing a new tool to make it simpler for ML teams to curate their unstructured data and improve the quality of datasets.<p>ML teams have a lot of files- texts, images, video, pdfs, etc. Those objects have information (metadata) about them (e.g. labels, embeddings, captions). Last 5-6 years building DVC, we’ve observed a need for data versioning at scale but also a very strong need to store, enrich this metadata and also slice and dice those files based on this metadata (create datasets). We’d seen teams keep building and rebuilding the same infra or glue ETLs and scripts again and again and decided that it’s better to solve this in a more systematic way using our knowledge and experience.<p>DataChain is a Python library. Think about DataChain as a “data frame” with a “chain” of operations that can be applied to it - filter, merge, map, etc. We don’t store (or require moving or converting) files, rather DataChain is storing references to the origin (paths + version id). It is using a database underneath (SQLite) to preserve results (datasets) and do out of memory computation. It can do parallel computations, data caching, and many more things that make it better suited for unstructured data, ML, and larger scale.<p>Saved datasets (Data chains) can be passed to a data loader (e.g. Pytorch) to access the original raw files + metadata from the DB.<p>Please let us know your thoughts and questions in the comments!