I've used DVC for most of my projects for the past five years. The good things is that it works a lot like git. If your scientists understand branches, commits and diffs, they should be able to understand DVC. The bad thing is that it works like git. Scientists often do not, in fact, understand or use branches, commits and diffs. The best thing is that it essentially forces you to follow Ten Simple Rules for Reproducible Computational Research [1]. Reproducibility has been a huge challenge on teams I've worked on.<p>[1] <a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285" rel="nofollow">https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...</a>
hi there! Maintainer and author here. Excited to see DVC on the front page!<p>Happy to answer any questions about DVC and our sister project DataChain <a href="https://github.com/iterative/datachain">https://github.com/iterative/datachain</a> that does data versioning with a bit different assumptions: no file copy and built-in data transformations.
Great to see DVC being discussed here! As a tool, it’s done a lot to simplify version control for data and models, and it’s been a game-changer for
many in the MLOps space.<p>Specifically, it's a genius way to store large files in git repos directly on any object storage without custom application servers like git-lfs or rewriting git from scratch...<p>At DagsHub [0], we've integrated directly with DVC for a looong time, so teams can use it with added features like visualizing and labeling datasets managing and models, running experiments collaboratively, and tracking everything (code, data, models, etc.) all in one place.<p>Just wanted to share that for those already using or considering DVC—there are some options to use it as a building block in a more end-to-end toolchain.<p>[0] <a href="https://dagshub.com" rel="nofollow">https://dagshub.com</a>
It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?
We actually were considering DVC, however for our particular use case (huge video files which don't change much) the git paradigm was not that useful - you need at least one copy of the data on the origin and another one on each system that's doing the training. So in the end we just went with files and folders on a NAS, seemed to work good enough.<p>A hybrid solution of keeping dataset metadata under DVC and then versioning that could work. This was many years ago though and I would be curious if there are any other on-prem data versioning solutions, when I last searched all of them seem geared towards the cloud.
I had a lot of problems when using it with a dataset of many jpg Files.<p>The indexing for every dvc status took many minutes to check every file. Caching did not work.<p>Sadly I had to let go of it.