TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

ML Experiments Management with Git

112 pointsby shchekleinover 1 year ago

7 comments

krastanovover 1 year ago
Another option, that manages versioning of your computational graph and its results and provides extremely elegant query-able memoization is Mandala <a href="https:&#x2F;&#x2F;github.com&#x2F;amakelov&#x2F;mandala">https:&#x2F;&#x2F;github.com&#x2F;amakelov&#x2F;mandala</a><p>It is a much simpler and much more magical piece of software that truly expanded how I think about writing, exploring, and experimenting with code. Even if you never use it, you probably would really enjoy reading the blog posts the author wrote about the design of the tool <a href="https:&#x2F;&#x2F;amakelov.github.io&#x2F;blog&#x2F;pl&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;amakelov.github.io&#x2F;blog&#x2F;pl&#x2F;</a>
评论 #38133585 未加载
评论 #38133233 未加载
shchekleinover 1 year ago
One of the maintainers here. I published this link tbh to specifically emphasize the experiment management aspect of DVC. Historically because of its name (Data Version Control) users perceived it as a pure replacement for LFS scenarios, while in reality it always had pipelines, metrics, etc, etc.<p>I 100% agree that managing large datasets by moving them around is not practical, and definitely not in LFS&#x2F;DVC-style. There should be a level of indirection if reproducibility is needed (pointers are versioned to files, not the data directly, data should be staying in the cloud).<p>Here, I would love to one more time mention some other cool features that DVC has. E.g. `dvc exp` set of commands where it is creating custom git refs to snapshot experiments, of DVCLive logger that helps capturing metrics, plots, etc. And also VS Code extension [1] that provides quite cool experience for experiments workflow inside VS Code.<p>Point here is that for DVC the ability to capture some large files and directories (that do not fit into Git) was always a low level mechanism to support higher level scenarios (e.g. you need to save a model somewhere as an output of an experiment).<p>[1] <a href="https:&#x2F;&#x2F;marketplace.visualstudio.com&#x2F;items?itemName=Iterative.dvc" rel="nofollow noreferrer">https:&#x2F;&#x2F;marketplace.visualstudio.com&#x2F;items?itemName=Iterativ...</a>
评论 #38126078 未加载
skadamatover 1 year ago
I work at XetHub and we&#x27;re taking a different approach here to managing ML experiments in git.<p>Instead of trying to store data and ML models in one place (like S3) and code, models &amp; documentation in another place (like GitHub), we are scaling Git so you can version everything in a single system. You can just use git and you don&#x27;t need to learn a new tool or set of commands.<p>This way, you can start with a simple experiment tracking approach of folders inside the same branch and then evolve gradually to multiple branches with long running experiments.<p>We&#x27;re about to release our Github integration, so data &amp; ML teams can take advantage of this inside their existing Github repos. If anyone wants a tour or wants to chat, my email&#x27;s in my HN profile.<p>If you&#x27;re curious about our tech:<p>- Here&#x27;s an example 3.3 TB Git repo: <a href="https:&#x2F;&#x2F;xethub.com&#x2F;XetHub&#x2F;RedPajama-Data-1T" rel="nofollow noreferrer">https:&#x2F;&#x2F;xethub.com&#x2F;XetHub&#x2F;RedPajama-Data-1T</a><p>- We wrote a paper on our solution to scale git to 100 terabytes: <a href="https:&#x2F;&#x2F;about.xethub.com&#x2F;blog&#x2F;git-is-for-data-published-in-cidr-2023" rel="nofollow noreferrer">https:&#x2F;&#x2F;about.xethub.com&#x2F;blog&#x2F;git-is-for-data-published-in-c...</a><p>- We created a Rust library to mount large repos to machines with limited storage space: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37573679">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37573679</a>
BradyJ27over 1 year ago
I have used dvc (specifically pipelines and experiments) for a little while now and I have found it to be a great tool for creating a standardized process for training ML models. Workflows in so many different teams consist of a bunch of notebooks that aren&#x27;t versioned, that are all on developers local machines, and just no reproducibility or standardization. DVC is a great lightweight tool that is easy to setup and use, customizable to whatever hardware or architecture that you are using. Most teams that I have seen have data and models on local machines, and do not version them whatsoever. DVC has been great for creating reproducible models, which has always been the biggest focus point for me. Overall I think it is a great tool and does a whole lot more than just data version control, things like experiments and DVCLive are super great.
vinni2over 1 year ago
I gave up on dvc and instead switched to huggingface and wandb because of the way it handled large files and large local cache it downloaded.
评论 #38122009 未加载
评论 #38122029 未加载
评论 #38121125 未加载
评论 #38126117 未加载
评论 #38129420 未加载
评论 #38122152 未加载
评论 #38123835 未加载
farhanhubbleover 1 year ago
I&#x27;m using DVC for managing experiments as well as data versioning for 100,000s of files. Its git-like interface is great but it does have scaling issues, especially with hooks taking tens of minutes on every commit. It also does not support parallel stage execution yet.
fancy_pantserover 1 year ago
Use Determined if you want a nice UI <a href="https:&#x2F;&#x2F;github.com&#x2F;determined-ai&#x2F;determined#readme">https:&#x2F;&#x2F;github.com&#x2F;determined-ai&#x2F;determined#readme</a>
评论 #38121986 未加载
评论 #38122708 未加载