DVC has had the following problems, when I tested it (half a year ago):<p>I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.<p>You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.<p>Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.<p>Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.<p>There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.<p>This sounds negative but I think it is currently the one of the best tools in this space.
If you just want a git for large data files, and your files don't get updated too often (e.g. an ML model deployed in production which gets updated every month) then git-lfs is a nice solution. Bitbucket and Github both have support for it.
Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?<p>[<a href="https://www.datalad.org/" rel="nofollow">https://www.datalad.org/</a>]
If you're looking for something that actually tracks tabular data there's <a href="https://kartproject.org" rel="nofollow">https://kartproject.org</a>. Geo focused but also works with standard database tables. Built with git (kart repos are git repos), can track PostgreSQL, MSSQL, MySQL etc.
I don't think this tool can encompass everything you need in managing ML models and data sets, even if you limit it to versioning data.<p>I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.<p>And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.