Great post. The ethernet section is especially interesting to me.<p>I'm building a cluster of 16x Dell XE9680's (128 AMD MI300x GPUs) [0], with 8x 2p200G broadcom cards (running at 400G), all connected to a single Dell PowerSwitch Z9864F-ON, which should prevent any slowness. It will be connected over rocev2 [1].<p>We're going with ethernet because we believe in open standards, and few talk about the fact that the lead time on IB was last quoted to me at 50+ weeks. As kind of mentioned in the article, if you can't even deploy a cluster the speed of the network means less and less.<p>I can't wait to do some benchmarking on the system to see if we run into similar issues or not. Thankfully, we have a great Dell partnership, with full support, so I believe that we are well covered in terms of any potential issues.<p>Our datacenter is 100% green and low PUE and we are very proud of that as well. Hope to announce which one soon.<p><pre><code> [0] https://hotaisle.xyz/compute/
[1] https://hotaisle.xyz/networking/</code></pre>
The only essential aspect this article doesn't answer: How much does it cost? All the rest is metadata. I would have preferred a clear table with vendors, prices and features. And less bla-bla.
> Electricity sources and CO2 emissions<p>I love that they included this in their consideration and pointed out the impact running these GPUs has on the environment.
Lots of good and detailed information here, thanks. I'm curious why Ethernet interconnect is so unreliable in practice compared to the Infiniband. I would think that at this point, after a decade or more of current Ethernet standards, all the kinks would be worked out and the worst that would happen would be occasional latency spikes and a few lost packets that could be retransmitted quickly. Shouldn't the training frameworks be more robust to that sort of thing?
Good info! I use an HPC with SLURM. 40k GPUs shared by hundreds of users. It works well enough. I don’t know how the market for cloud-based clusters works. Why didn’t OP use AWS or Google for on-demand training? Is it just down to cost?