TechEcho

11 comments

I've used DVC for most of my projects for the past five years. The good things is that it works a lot like git. If your scientists understand branches, commits and diffs, they should be able to understand DVC. The bad thing is that it works like git. Scientists often do not, in fact, understand or use branches, commits and diffs. The best thing is that it essentially forces you to follow Ten Simple Rules for Reproducible Computational Research [1]. Reproducibility has been a huge challenge on teams I've worked on.[1] <a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285" rel="nofollow">https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...</a>

评论 #41954310 未加载

dmpetrov7 months ago

hi there! Maintainer and author here. Excited to see DVC on the front page!Happy to answer any questions about DVC and our sister project DataChain <a href="https://github.com/iterative/datachain">https://github.com/iterative/datachain</a> that does data versioning with a bit different assumptions: no file copy and built-in data transformations.

评论 #41890932 未加载

评论 #41896923 未加载

评论 #41897005 未加载

dpleban7 months ago

Great to see DVC being discussed here! As a tool, it’s done a lot to simplify version control for data and models, and it’s been a game-changer for many in the MLOps space.Specifically, it's a genius way to store large files in git repos directly on any object storage without custom application servers like git-lfs or rewriting git from scratch...At DagsHub [0], we've integrated directly with DVC for a looong time, so teams can use it with added features like visualizing and labeling datasets managing and models, running experiments collaboratively, and tracking everything (code, data, models, etc.) all in one place.Just wanted to share that for those already using or considering DVC—there are some options to use it as a building block in a more end-to-end toolchain.[0] <a href="https://dagshub.com" rel="nofollow">https://dagshub.com</a>

jiangplus7 months ago

How does it compare to Oxen?<a href="https://github.com/Oxen-AI/Oxen">https://github.com/Oxen-AI/Oxen</a>

评论 #41895851 未加载

评论 #41896651 未加载

评论 #41894597 未加载

jerednel7 months ago

It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?

评论 #41889814 未加载

ulnarkressty7 months ago

We actually were considering DVC, however for our particular use case (huge video files which don't change much) the git paradigm was not that useful - you need at least one copy of the data on the origin and another one on each system that's doing the training. So in the end we just went with files and folders on a NAS, seemed to work good enough.A hybrid solution of keeping dataset metadata under DVC and then versioning that could work. This was many years ago though and I would be curious if there are any other on-prem data versioning solutions, when I last searched all of them seem geared towards the cloud.

shicholas7 months ago

What are the benefits of DVC over Apache Iceberg? If anyone used both, I'd be curious about your take. Thanks!

评论 #41895113 未加载

评论 #41890570 未加载

评论 #41895053 未加载

notrealyme1237 months ago

I had a lot of problems when using it with a dataset of many jpg Files.The indexing for every dvc status took many minutes to check every file. Caching did not work.Sadly I had to let go of it.

评论 #41892807 未加载

sohooo7 months ago

I also heart about lakeFS for data versioning on S3 object stores. Is DVC a contender in this area?

causal7 months ago

This useful for large binaries?

评论 #41892136 未加载

评论 #41895103 未加载

评论 #41891754 未加载

评论 #41892002 未加载

tomtom13377 months ago

The animated ripple across the «what’s new» button is infuriating. It keeps drawing my attention from reading what this is.

评论 #41908862 未加载

评论 #41895810 未加载

11 comments

bramathon7 months ago

评论 #41954310 未加载

dmpetrov7 months ago

评论 #41890932 未加载

评论 #41896923 未加载

评论 #41897005 未加载

dpleban7 months ago

jiangplus7 months ago

How does it compare to Oxen?<a href="https://github.com/Oxen-AI/Oxen">https://github.com/Oxen-AI/Oxen</a>

评论 #41895851 未加载

评论 #41896651 未加载

评论 #41894597 未加载

jerednel7 months ago

评论 #41889814 未加载

ulnarkressty7 months ago

shicholas7 months ago

What are the benefits of DVC over Apache Iceberg? If anyone used both, I'd be curious about your take. Thanks!

评论 #41895113 未加载

评论 #41890570 未加载

评论 #41895053 未加载

notrealyme1237 months ago

I had a lot of problems when using it with a dataset of many jpg Files.The indexing for every dvc status took many minutes to check every file. Caching did not work.Sadly I had to let go of it.

评论 #41892807 未加载

sohooo7 months ago

I also heart about lakeFS for data versioning on S3 object stores. Is DVC a contender in this area?

causal7 months ago

This useful for large binaries?

评论 #41892136 未加载

评论 #41895103 未加载

评论 #41891754 未加载

评论 #41892002 未加载

tomtom13377 months ago

The animated ripple across the «what’s new» button is infuriating. It keeps drawing my attention from reading what this is.

评论 #41908862 未加载

评论 #41895810 未加载

Data Version Control

11 comments

Data Version Control

11 comments