TechEcho

8 comments

latchkey10 months ago

Great post. The ethernet section is especially interesting to me.I'm building a cluster of 16x Dell XE9680's (128 AMD MI300x GPUs) [0], with 8x 2p200G broadcom cards (running at 400G), all connected to a single Dell PowerSwitch Z9864F-ON, which should prevent any slowness. It will be connected over rocev2 [1].We're going with ethernet because we believe in open standards, and few talk about the fact that the lead time on IB was last quoted to me at 50+ weeks. As kind of mentioned in the article, if you can't even deploy a cluster the speed of the network means less and less.I can't wait to do some benchmarking on the system to see if we run into similar issues or not. Thankfully, we have a great Dell partnership, with full support, so I believe that we are well covered in terms of any potential issues.Our datacenter is 100% green and low PUE and we are very proud of that as well. Hope to announce which one soon.<pre><code> [0] https://hotaisle.xyz/compute/ [1] https://hotaisle.xyz/networking/</code></pre>

评论 #40950791 未加载

评论 #40948378 未加载

评论 #40945560 未加载

评论 #40946739 未加载

评论 #40945573 未加载

评论 #40947485 未加载

评论 #40945644 未加载

评论 #40948515 未加载

评论 #40945500 未加载

评论 #40948030 未加载

评论 #40948056 未加载

评论 #40946504 未加载

huqedato10 months ago

The only essential aspect this article doesn't answer: How much does it cost? All the rest is metadata. I would have preferred a clear table with vendors, prices and features. And less bla-bla.

评论 #40946274 未加载

barbazoo10 months ago

> Electricity sources and CO2 emissionsI love that they included this in their consideration and pointed out the impact running these GPUs has on the environment.

评论 #40947878 未加载

评论 #40948094 未加载

eigenvalue10 months ago

Lots of good and detailed information here, thanks. I'm curious why Ethernet interconnect is so unreliable in practice compared to the Infiniband. I would think that at this point, after a decade or more of current Ethernet standards, all the kinks would be worked out and the worst that would happen would be occasional latency spikes and a few lost packets that could be retransmitted quickly. Shouldn't the training frameworks be more robust to that sort of thing?

评论 #40946908 未加载

评论 #40951357 未加载

ec10968510 months ago

How do the large clouds compare from an availability and cost perspective compared to finding a smaller provider and renting a dedicated cluster?

评论 #40947860 未加载

评论 #40951394 未加载

silverlake10 months ago

Good info! I use an HPC with SLURM. 40k GPUs shared by hundreds of users. It works well enough. I don’t know how the market for cloud-based clusters works. Why didn’t OP use AWS or Google for on-demand training? Is it just down to cost?

评论 #40946363 未加载

8organicbits10 months ago

Pretty sparse on pricing data, I guess everyone asked them to keep it private.

评论 #40945416 未加载

评论 #40946827 未加载

评论 #40945966 未加载

评论 #40915553 未加载

Jun810 months ago

Say you want to burn about $500 as a curiosity project for 8 nodes for a day. Any suggestions for what job to run?

评论 #40954094 未加载

评论 #40954785 未加载

评论 #40949752 未加载

8 comments

latchkey10 months ago

评论 #40950791 未加载

评论 #40948378 未加载

评论 #40945560 未加载

评论 #40946739 未加载

评论 #40945573 未加载

评论 #40947485 未加载

评论 #40945644 未加载

评论 #40948515 未加载

评论 #40945500 未加载

评论 #40948030 未加载

评论 #40948056 未加载

评论 #40946504 未加载

huqedato10 months ago

The only essential aspect this article doesn't answer: How much does it cost? All the rest is metadata. I would have preferred a clear table with vendors, prices and features. And less bla-bla.

评论 #40946274 未加载

barbazoo10 months ago

> Electricity sources and CO2 emissionsI love that they included this in their consideration and pointed out the impact running these GPUs has on the environment.

评论 #40947878 未加载

评论 #40948094 未加载

eigenvalue10 months ago

评论 #40946908 未加载

评论 #40951357 未加载

ec10968510 months ago

How do the large clouds compare from an availability and cost perspective compared to finding a smaller provider and renting a dedicated cluster?

评论 #40947860 未加载

评论 #40951394 未加载

silverlake10 months ago

评论 #40946363 未加载

8organicbits10 months ago

Pretty sparse on pricing data, I guess everyone asked them to keep it private.

评论 #40945416 未加载

评论 #40946827 未加载

评论 #40945966 未加载

评论 #40915553 未加载

Jun810 months ago

Say you want to burn about $500 as a curiosity project for 8 nodes for a day. Any suggestions for what job to run?

评论 #40954094 未加载

评论 #40954785 未加载

评论 #40949752 未加载

So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide

8 comments

So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide

8 comments