科技回声

After dealing with the pain of distributed pytorch model development & training across multiple companies, I decided to build a quick open source tool that hopefully will be helpful to some of y’all.The ProblemSay you have an on-demand GPU cluster from a provider like Lambda labs. Single node GPU training is fairly painless– just SSH in, provision the node and run your code.However, the moment you go multi-node, things get a bit trickier. Each node has to be provisioned to have the same dependencies and have access to the same code, and then you need some way of submitting and managing the job across all nodes.The simple, painstaking way is to SSH into each node, configure and then run your code with `torchrun`. But, doing this over and over again while trying to rapidly develop is a huge PTA.This is where SLURM or Ray is typically used to manage job submissions. But, setting up SLURM is time consuming and non-trivial (especially when dealing with Infiniband). If you have the cash to reserve a cluster for a year, this is not a problem. But, if you’re running on-demand with clusters of varying size. This quickly becomes painful.Ray tries to solve this but if you’ve ever used it extensively, it’s ridden with bugs and highly sensitive to the underlying resources (shared memory) and host networking.Enter TorchSubmitTorchSubmit is a lightweight job submission & cluster management tool for submitting and managing distributed training jobs.It handles two things: Syncing your working directory to all the nodes in your cluster Executing the job across the cluster with `torchrun` properly configured for multi-node (or single-node) distributed training.It allows for rapid development with minimal configuration so you’re spending your time on training instead of dev ops.Under the hood, it uses Fabric to connect to and manage nodes via SSH and `torchrun` to run fault-tolerant distributed training.How to use1. First, provision your cluster with your python dependencies. Personally, I use ansible to do this. On your local machine run, `pip install torch-submit`2. Create a cluster with `torch-submit cluster create`. This will interactively walk you through cluster setup. All you need to do is provide it each node’s public and private IP addresses (optional but recommended for inter-node communication), and the path to your SSH key. Submit your first job with `torch-submit job submit – python train.py`3. You’ll now see the job running under `torch-submit job list`4. To see the logs from your job `torch-submit job logs –tail <job-id>`Github project: <a href="https://github.com/dream3d-ai/torch-submit">https://github.com/dream3d-ai/torch-submit</a>Happy training and hope this helps!

Show HN: TorchSubmit – Painless multi-node training with PyTorch (no SLURM/K8s)

暂无评论

Show HN: TorchSubmit – Painless multi-node training with PyTorch (no SLURM/K8s)

暂无评论