Most of the information about this is in this PDF (I hate when people publish interesting information exclusively in PDFs): <a href="https://raw.githubusercontent.com/NousResearch/DisTrO/main/A_Preliminary_Report_on_DisTrO.pdf" rel="nofollow">https://raw.githubusercontent.com/NousResearch/DisTrO/main/A...</a><p>I converted it to Markdown (using Gemini 1.5 Pro) and pasted it into a Gist here: <a href="https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdabb4e#a-preliminary-report-on-distro" rel="nofollow">https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab...</a><p>From the abstract:<p>> Training large scale neural networks typically involves sharing gradients between all accelerators, which necessitates specialized, high-speed interconnects. To address this, we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.<p>This could be a HUGE deal.<p>Currently if you want to train giant LLMs you need a big pile of GPUs in the same location as each other due to the amount of information that needs to shuffle between them during training.<p>If DisTrO works as intended, it will be possible to train models using GPUs in different places - potentially enabling SETI@home style training where thousands of people with gaming PCs at home could donate their GPU time to a large training effort.<p>Their tweet about this has more: <a href="https://twitter.com/NousResearch/status/1828121648383566270" rel="nofollow">https://twitter.com/NousResearch/status/1828121648383566270</a><p>> Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis, and matches AdamW+All-Reduce in convergence rates. This enables low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.<p>> DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.<p>> Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole.