TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide

297 pointsby ea01610 months ago

8 comments

latchkey10 months ago
Great post. The ethernet section is especially interesting to me.<p>I&#x27;m building a cluster of 16x Dell XE9680&#x27;s (128 AMD MI300x GPUs) [0], with 8x 2p200G broadcom cards (running at 400G), all connected to a single Dell PowerSwitch Z9864F-ON, which should prevent any slowness. It will be connected over rocev2 [1].<p>We&#x27;re going with ethernet because we believe in open standards, and few talk about the fact that the lead time on IB was last quoted to me at 50+ weeks. As kind of mentioned in the article, if you can&#x27;t even deploy a cluster the speed of the network means less and less.<p>I can&#x27;t wait to do some benchmarking on the system to see if we run into similar issues or not. Thankfully, we have a great Dell partnership, with full support, so I believe that we are well covered in terms of any potential issues.<p>Our datacenter is 100% green and low PUE and we are very proud of that as well. Hope to announce which one soon.<p><pre><code> [0] https:&#x2F;&#x2F;hotaisle.xyz&#x2F;compute&#x2F; [1] https:&#x2F;&#x2F;hotaisle.xyz&#x2F;networking&#x2F;</code></pre>
评论 #40950791 未加载
评论 #40948378 未加载
评论 #40945560 未加载
评论 #40946739 未加载
评论 #40945573 未加载
评论 #40947485 未加载
评论 #40945644 未加载
评论 #40948515 未加载
评论 #40945500 未加载
评论 #40948030 未加载
评论 #40948056 未加载
评论 #40946504 未加载
huqedato10 months ago
The only essential aspect this article doesn&#x27;t answer: How much does it cost? All the rest is metadata. I would have preferred a clear table with vendors, prices and features. And less bla-bla.
评论 #40946274 未加载
barbazoo10 months ago
&gt; Electricity sources and CO2 emissions<p>I love that they included this in their consideration and pointed out the impact running these GPUs has on the environment.
评论 #40947878 未加载
评论 #40948094 未加载
eigenvalue10 months ago
Lots of good and detailed information here, thanks. I&#x27;m curious why Ethernet interconnect is so unreliable in practice compared to the Infiniband. I would think that at this point, after a decade or more of current Ethernet standards, all the kinks would be worked out and the worst that would happen would be occasional latency spikes and a few lost packets that could be retransmitted quickly. Shouldn&#x27;t the training frameworks be more robust to that sort of thing?
评论 #40946908 未加载
评论 #40951357 未加载
ec10968510 months ago
How do the large clouds compare from an availability and cost perspective compared to finding a smaller provider and renting a dedicated cluster?
评论 #40947860 未加载
评论 #40951394 未加载
silverlake10 months ago
Good info! I use an HPC with SLURM. 40k GPUs shared by hundreds of users. It works well enough. I don’t know how the market for cloud-based clusters works. Why didn’t OP use AWS or Google for on-demand training? Is it just down to cost?
评论 #40946363 未加载
8organicbits10 months ago
Pretty sparse on pricing data, I guess everyone asked them to keep it private.
评论 #40945416 未加载
评论 #40946827 未加载
评论 #40945966 未加载
评论 #40915553 未加载
Jun810 months ago
Say you want to burn about $500 as a curiosity project for 8 nodes for a day. Any suggestions for what job to run?
评论 #40954094 未加载
评论 #40954785 未加载
评论 #40949752 未加载