From the paper, seems like they are using RDMA to/from video cards, skipping the nic.<p>> * These transactions
require GPU-to-RDMA NIC support for optimal performance*<p>Remarkably consumer computing actually has similarly found reason to bypass sending data through the cpu; texture streaming. DirectStorage and Sony's Kraken purport to let the GPU read direct from the SSD. It's a storage application instead of NIC, but still built around PCIe DMA-P2P (at least the DirectStorage is I think).<p>Table 2, network stats for 128 GPUs is kind of interesting. Most topologies such as AllGather and AllReduce run with only 4 Queue Pairs. Not my area of expertise at all but wow that seems tiny! All this network, and basically everyone's talking to only a few peers? That's what it means right?<p>The discussion at the end of the paper talked about Flowlets. The description makes me think a little bit of hash bucket chaining, where you try the first path, and if latter a conflict arise or the oath degrades, there's a fallback path already planned. Like there's would be a fallback chained bucket in a hash.