TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: TorchSubmit – Painless multi-node training with PyTorch (no SLURM/K8s)

3 点作者 tony_francis10 个月前
After dealing with the pain of distributed pytorch model development &amp; training across multiple companies, I decided to build a quick open source tool that hopefully will be helpful to some of y’all.<p>The Problem<p>Say you have an on-demand GPU cluster from a provider like Lambda labs. Single node GPU training is fairly painless– just SSH in, provision the node and run your code.<p>However, the moment you go multi-node, things get a bit trickier. Each node has to be provisioned to have the same dependencies and have access to the same code, and then you need some way of submitting and managing the job across all nodes.<p>The simple, painstaking way is to SSH into each node, configure and then run your code with `torchrun`. But, doing this over and over again while trying to rapidly develop is a huge PTA.<p>This is where SLURM or Ray is typically used to manage job submissions. But, setting up SLURM is time consuming and non-trivial (especially when dealing with Infiniband). If you have the cash to reserve a cluster for a year, this is not a problem. But, if you’re running on-demand with clusters of varying size. This quickly becomes painful.<p>Ray tries to solve this but if you’ve ever used it extensively, it’s ridden with bugs and highly sensitive to the underlying resources (shared memory) and host networking.<p>Enter TorchSubmit<p>TorchSubmit is a lightweight job submission &amp; cluster management tool for submitting and managing distributed training jobs.<p>It handles two things: Syncing your working directory to all the nodes in your cluster Executing the job across the cluster with `torchrun` properly configured for multi-node (or single-node) distributed training.<p>It allows for rapid development with minimal configuration so you’re spending your time on training instead of dev ops.<p>Under the hood, it uses Fabric to connect to and manage nodes via SSH and `torchrun` to run fault-tolerant distributed training.<p>How to use<p>1. First, provision your cluster with your python dependencies. Personally, I use ansible to do this. On your local machine run, `pip install torch-submit`<p>2. Create a cluster with `torch-submit cluster create`. This will interactively walk you through cluster setup. All you need to do is provide it each node’s public and private IP addresses (optional but recommended for inter-node communication), and the path to your SSH key. Submit your first job with `torch-submit job submit – python train.py`<p>3. You’ll now see the job running under `torch-submit job list`<p>4. To see the logs from your job `torch-submit job logs –tail &lt;job-id&gt;`<p>Github project: <a href="https:&#x2F;&#x2F;github.com&#x2F;dream3d-ai&#x2F;torch-submit">https:&#x2F;&#x2F;github.com&#x2F;dream3d-ai&#x2F;torch-submit</a><p>Happy training and hope this helps!

暂无评论

暂无评论