科技回声

4 条评论

arjvik9 个月前

There's no information about what this is, beyond a teaser of a loss graph. Really hoping this is something that gets released to the world, not hidden behind closed doors.

评论 #41373176 未加载

logicchains9 个月前

I'd love to believe it's true but I suspect they're overstating the result, or it's a fluke. Presumably teams at large firms like Meta would have put a lot of effort into checking whether not-synchronise-every-step training could match synchronise-every-step training before investing hundreds of millions of dollars into the low-latency, high-throughput network hardware necessary for the latter.

评论 #41372718 未加载

评论 #41372876 未加载

评论 #41374559 未加载

评论 #41372653 未加载

iamronaldo9 个月前

This seems huge no? Couldn't this enable "community based" ai training at home?

评论 #41376441 未加载

simonw9 个月前

Most of the information about this is in this PDF (I hate when people publish interesting information exclusively in PDFs): <a href="https://raw.githubusercontent.com/NousResearch/DisTrO/main/A_Preliminary_Report_on_DisTrO.pdf" rel="nofollow">https://raw.githubusercontent.com/NousResearch/DisTrO/main/A...</a>I converted it to Markdown (using Gemini 1.5 Pro) and pasted it into a Gist here: <a href="https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdabb4e#a-preliminary-report-on-distro" rel="nofollow">https://gist.github.com/simonw/46a33d66e069efe5c10b63625fdab...</a>From the abstract:> Training large scale neural networks typically involves sharing gradients between all accelerators, which necessitates specialized, high-speed interconnects. To address this, we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.This could be a HUGE deal.Currently if you want to train giant LLMs you need a big pile of GPUs in the same location as each other due to the amount of information that needs to shuffle between them during training.If DisTrO works as intended, it will be possible to train models using GPUs in different places - potentially enabling SETI@home style training where thousands of people with gaming PCs at home could donate their GPU time to a large training effort.Their tweet about this has more: <a href="https://twitter.com/NousResearch/status/1828121648383566270" rel="nofollow">https://twitter.com/NousResearch/status/1828121648383566270</a>> Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by 1000x to 10,000x without relying on amortized analysis, and matches AdamW+All-Reduce in convergence rates. This enables low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.> DisTrO can increase the resilience and robustness of training LLMs by minimizing dependency on a single entity for computation. DisTrO is one step towards a more secure and equitable environment for all participants involved in building LLMs.> Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole.

评论 #41372013 未加载

评论 #41381214 未加载

评论 #41374942 未加载

4 条评论

arjvik9 个月前

There's no information about what this is, beyond a teaser of a loss graph. Really hoping this is something that gets released to the world, not hidden behind closed doors.

DisTrO – a family of low latency distributed optimizers

4 条评论

DisTrO – a family of low latency distributed optimizers

4 条评论