科技回声

4 条评论

From the paper, seems like they are using RDMA to/from video cards, skipping the nic.> * These transactions require GPU-to-RDMA NIC support for optimal performance*Remarkably consumer computing actually has similarly found reason to bypass sending data through the cpu; texture streaming. DirectStorage and Sony's Kraken purport to let the GPU read direct from the SSD. It's a storage application instead of NIC, but still built around PCIe DMA-P2P (at least the DirectStorage is I think).Table 2, network stats for 128 GPUs is kind of interesting. Most topologies such as AllGather and AllReduce run with only 4 Queue Pairs. Not my area of expertise at all but wow that seems tiny! All this network, and basically everyone's talking to only a few peers? That's what it means right?The discussion at the end of the paper talked about Flowlets. The description makes me think a little bit of hash bucket chaining, where you try the first path, and if latter a conflict arise or the oath degrades, there's a fallback path already planned. Like there's would be a fallback chained bucket in a hash.

评论 #41165857 未加载

评论 #41167507 未加载

评论 #41167311 未加载

eslaught9 个月前

So they're re-inventing HPC networks in the data center.<a href="https://en.wikipedia.org/wiki/Fat_tree" rel="nofollow">https://en.wikipedia.org/wiki/Fat_tree</a><a href="https://www.cs.umd.edu/class/spring2021/cmsc714/readings/Kim-Dragonfly.pdf" rel="nofollow">https://www.cs.umd.edu/class/spring2021/cmsc714/readings/Kim...</a>I'm sure there are innovations here, but most of this has been standard in HPC for decades. (Fat trees since 1985, Dragonfly since 2008.) This is not new science, folks.

评论 #41165867 未加载

teleforce9 个月前

Interesting approach on distributed AI training albeit a very expensive one. Personally I'm baffled why no one has come up with a similar project to SETI@home or Great Internet Mersenne Prime Search in harnessing truly distributed and low cost solutions to open model of AI training at scale [1],[2].[1] SETI@home:<a href="https://setiathome.berkeley.edu/" rel="nofollow">https://setiathome.berkeley.edu/</a>[2] Great Internet Mersenne Prime Search:<a href="https://en.wikipedia.org/wiki/Great_Internet_Mersenne_Prime_Search" rel="nofollow">https://en.wikipedia.org/wiki/Great_Internet_Mersenne_Prime_...</a>

A RoCE network for distributed AI training at scale

4 条评论

A RoCE network for distributed AI training at scale

4 条评论