TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How do you manage your ML experiments?

4 pointsby ridgeflexover 4 years ago
For work, I need to run many PyTorch&#x2F;TF&#x2F;MXNet experiments in parallel on a cloud instance with multiple GPUs. Currently, I use Tensorboard (and its variants) to log results and tmux to run experiments simultaneously on multiple GPUs.<p>However, I often run into these issues:<p>1. Some experiments fail due to run-time errors and tmux allows them to fail silently<p>2. Some experiments cause a GPU to run out of memory, and I have to dig through many tmux sessions to find and re-run that experiment<p>3. If many GPUs are close to full, I have to revert to running experiments in sequence, and have to wait until experiment_i is over before running experiment_i+1<p>4. When running different experiments, I have to manually estimate how much GPU memory a specific experiment will consume before I can deploy them onto multiple GPUs<p>5. When doing a particularly tedious task (eg. hyper-parameter search), there will often be on the order of a hundred experiments; this becomes extremely difficult to manually maintain using tmux<p>Ideally, a perfect solution for this workflow would be a tool that could 1) profile memory consumption for a set of experiments, 2) automatically deploy experiments onto a cluster of GPUs, 2) re-run, queue, or re-assign experiments to other GPUs if needed, and 4) send notifications&#x2F;keep track of all experiment progress.<p>I currently know of other tools like PyTorch Lightning (which only works with PyTorch and requires a significant code restructure) and Weights &amp; Biases (only has experiment progress&#x2F;logging ability), but I have yet to find something that is lightweight and flexible enough to handle all of these requirements.<p>What&#x27;s the best way to manage experiments like this?

1 comment

p1eskover 4 years ago
I do something very similar (five 8xGPU servers, Pycharm, ssh, tmux), and I have no solution to the issues you described. I manually launch one ssh&#x2F;tmux session per server and typically have multiple tmux panes, with nvidia-smi and htop outputs. I keep reconnecting to these ssh&#x2F;tmux sessions to monitor progress. I also save the results of experiments to text files, so that at the end of the hyperparameter search I can just look at those files. Looking at files is sometimes easier&#x2F;quicker than looking through tmux sessions (files are kept in shared storage).<p>I&#x27;ve seen plenty of experiment management tools being advertised, but every time I looked at them they were either very limited, or required significant restructuring of my code or my workflow.<p>I&#x27;d like to hear about whatever solution you find because I agree, this does get tedious and painful sometimes.