Been working on a guide for ML folks to upgrade their single GPU training code to multi GPU and multi node. Code diffs and explanations are included.<p>The guide builds up to this final chapter (linked) on how to train a very large model like Llama 3.1 405B on a big cluster with plain pytorch.<p>Everything is just written using the direct pytorch apis (other than the model code which is just using `transformers` models).<p>If there are topics of interest feel free to open an issue in the repo, and contributions are welcome.<p>I'm investigating adding a chapter on tensor parallelism, but it's support in pytorch is still early stages.