TechEcho

8 comments

lizen_oneover 2 years ago

DVC has had the following problems, when I tested it (half a year ago):I gets super slow (waiting minutes) when there are a few thousand files tracked. Thousands files have to be tracked, if you have e.g. a 10GB file per day and region and artifacts generated from it.You are encouraged (it only can track artifacts) if you model your pipeline in DVC (think like make). However, it cannot run tasks it parallel. So it takes a lot of time to run a pipeline while you are on a beefy machine and only one core is used. Obviously, you cannot run other tools (e.g. snakemake) to distribute/parallelize on multiple machines. Running one (part of a) stage has also some overhead, because it does commit/checks after/before running the executable of the task.Sometimes you get merge conflicts, if you run a (partial parmaretized) stage on one machine and the other part on the other machine manually. These are cumbersome to fix.Currently, I think they are more focused on ML features like experiment tracking (I prefer other mature tools here) instead of performance and data safety.There is an alternative implementation from a single developer (I cannot find it right now) that fixes some problems. However, I do not use this because it propably will not have the same development progress and testing as DVC.This sounds negative but I think it is currently the one of the best tools in this space.

评论 #33068327 未加载

评论 #33056496 未加载

评论 #33057202 未加载

评论 #33057259 未加载

评论 #33058167 未加载

评论 #33063324 未加载

throwawaybutwhyover 2 years ago

The package phones home. One has to set an env var or fix several lines of code to prevent that.

评论 #33056172 未加载

评论 #33058674 未加载

评论 #33075882 未加载

评论 #33056130 未加载

adhocmobilityover 2 years ago

If you just want a git for large data files, and your files don't get updated too often (e.g. an ML model deployed in production which gets updated every month) then git-lfs is a nice solution. Bitbucket and Github both have support for it.

评论 #33060358 未加载

评论 #33061827 未加载

评论 #33060171 未加载

评论 #33061386 未加载

评论 #33059813 未加载

tomtheover 2 years ago

Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?[<a href="https://www.datalad.org/" rel="nofollow">https://www.datalad.org/</a>]

评论 #33060490 未加载

评论 #33056544 未加载

polemicover 2 years ago

If you're looking for something that actually tracks tabular data there's <a href="https://kartproject.org" rel="nofollow">https://kartproject.org</a>. Geo focused but also works with standard database tables. Built with git (kart repos are git repos), can track PostgreSQL, MSSQL, MySQL etc.

LaserToyover 2 years ago

Can it be used for large and fast changing datasets?Example: 100 TB, write us every 10 mins.Or, 1tb, parquet, 40% is rewritten daily.

评论 #33057570 未加载

评论 #33056117 未加载

smeagullover 2 years ago

I don't think this tool can encompass everything you need in managing ML models and data sets, even if you limit it to versioning data.I'd need such a tool to manage features, checkpoints and labels. This doesn't do any of that. Nor does it really handle merging multiple versions of data.And I'd really like the code to be handled separately from the data. Git is not the place to do this. Because the choice of picking pairs of code and data should happen at a higher level, and be tracked along with the results - that's not going in a repo - MLFlow or Tensorboard handles it better.

评论 #33063597 未加载

bs7280over 2 years ago

What value does this provide that I can't get by versioning my data in partitioned parquet files on s3?

评论 #33059652 未加载

8 comments

lizen_oneover 2 years ago

评论 #33068327 未加载

评论 #33056496 未加载

评论 #33057202 未加载

评论 #33057259 未加载

评论 #33058167 未加载

评论 #33063324 未加载

throwawaybutwhyover 2 years ago

The package phones home. One has to set an env var or fix several lines of code to prevent that.

评论 #33056172 未加载

评论 #33058674 未加载

评论 #33075882 未加载

评论 #33056130 未加载

adhocmobilityover 2 years ago

评论 #33060358 未加载

评论 #33061827 未加载

评论 #33060171 未加载

评论 #33061386 未加载

评论 #33059813 未加载

tomtheover 2 years ago

Can anyone compare this to DataLad [1], which someone introduced to me as "git for data"?[<a href="https://www.datalad.org/" rel="nofollow">https://www.datalad.org/</a>]

评论 #33060490 未加载

评论 #33056544 未加载

polemicover 2 years ago

LaserToyover 2 years ago

Can it be used for large and fast changing datasets?Example: 100 TB, write us every 10 mins.Or, 1tb, parquet, 40% is rewritten daily.

评论 #33057570 未加载

评论 #33056117 未加载

smeagullover 2 years ago

评论 #33063597 未加载

bs7280over 2 years ago

What value does this provide that I can't get by versioning my data in partitioned parquet files on s3?

评论 #33059652 未加载

Data Version Control

8 comments

Data Version Control

8 comments