TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

A RoCE network for distributed AI training at scale

81 点作者 mikece9 个月前

4 条评论

jauntywundrkind9 个月前
From the paper, seems like they are using RDMA to&#x2F;from video cards, skipping the nic.<p>&gt; * These transactions require GPU-to-RDMA NIC support for optimal performance*<p>Remarkably consumer computing actually has similarly found reason to bypass sending data through the cpu; texture streaming. DirectStorage and Sony&#x27;s Kraken purport to let the GPU read direct from the SSD. It&#x27;s a storage application instead of NIC, but still built around PCIe DMA-P2P (at least the DirectStorage is I think).<p>Table 2, network stats for 128 GPUs is kind of interesting. Most topologies such as AllGather and AllReduce run with only 4 Queue Pairs. Not my area of expertise at all but wow that seems tiny! All this network, and basically everyone&#x27;s talking to only a few peers? That&#x27;s what it means right?<p>The discussion at the end of the paper talked about Flowlets. The description makes me think a little bit of hash bucket chaining, where you try the first path, and if latter a conflict arise or the oath degrades, there&#x27;s a fallback path already planned. Like there&#x27;s would be a fallback chained bucket in a hash.
评论 #41165857 未加载
评论 #41167507 未加载
评论 #41167311 未加载
eslaught9 个月前
So they&#x27;re re-inventing HPC networks in the data center.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fat_tree" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fat_tree</a><p><a href="https:&#x2F;&#x2F;www.cs.umd.edu&#x2F;class&#x2F;spring2021&#x2F;cmsc714&#x2F;readings&#x2F;Kim-Dragonfly.pdf" rel="nofollow">https:&#x2F;&#x2F;www.cs.umd.edu&#x2F;class&#x2F;spring2021&#x2F;cmsc714&#x2F;readings&#x2F;Kim...</a><p>I&#x27;m sure there are innovations here, but most of this has been standard in HPC for decades. (Fat trees since 1985, Dragonfly since 2008.) This is not new science, folks.
评论 #41165867 未加载
teleforce9 个月前
Interesting approach on distributed AI training albeit a very expensive one. Personally I&#x27;m baffled why no one has come up with a similar project to SETI@home or Great Internet Mersenne Prime Search in harnessing truly distributed and low cost solutions to open model of AI training at scale [1],[2].<p>[1] SETI@home:<p><a href="https:&#x2F;&#x2F;setiathome.berkeley.edu&#x2F;" rel="nofollow">https:&#x2F;&#x2F;setiathome.berkeley.edu&#x2F;</a><p>[2] Great Internet Mersenne Prime Search:<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Great_Internet_Mersenne_Prime_Search" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Great_Internet_Mersenne_Prime_...</a>
评论 #41185830 未加载
评论 #41169633 未加载
评论 #41170593 未加载
zuckerma9 个月前
This is slick