科技回声

17 条评论

md_超过 1 年前

I'm confused. The article from September 1 linked to here is strangely future-tense ("But the firm’s latest investment in 10,000 of the company’s H100 GPUs dwarfs the power of this supercomputer....This AI cluster, worth more than $300 million, will offer a peak performance...").It links to a Tom's Hardware article (<a href="https://www.tomshardware.com/news/teslas-dollar300-million-ai-cluster-is-going-live-today" rel="nofollow noreferrer">https://www.tomshardware.com/news/teslas-dollar300-million-a...</a>) from August 28 that says "Tesla is about to flip the switch on its new AI cluster, featuring 10,000 Nvidia H100 compute GPUs") and says "Tesla is set to launch its highly-anticipated supercomputer on Monday..." (presumably the September 1 event).So, like, does Tesla actually have 10k H100s? Or do they have an order for 10k H100s? Or an intention to buy 10k H100s?Is the sole source for these articles this (<a href="https://twitter.com/SawyerMerritt/status/1696011140508045660" rel="nofollow noreferrer">https://twitter.com/SawyerMerritt/status/1696011140508045660</a>) random Twitter post by some guy who runs an online clothing company?I don't mean to snipe, but this article doesn't seem to rise to the extremely high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".

评论 #38405812 未加载

评论 #38405600 未加载

chollida1超过 1 年前

I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.Is this comparable to what they would work with?My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.Or is this about what the leading competitors are working with?Azure, for one, seems to have orders of magnitude more chips at their disposal.

评论 #38405462 未加载

评论 #38405884 未加载

评论 #38405907 未加载

评论 #38405454 未加载

评论 #38406064 未加载

评论 #38409074 未加载

评论 #38405412 未加载

jdiez17超过 1 年前

What happened to their custom hardware training stack Dojo? They had some interesting ideas there. The last I heard, they had one of those tiles "working" in the lab. Pretty far from a production setup.I can imagine they either underestimated the software effort needed to squeeze as much performance as possible out of those things, or they underestimated the pace at which Nvidia scales FLOPS/$, or both.

评论 #38405426 未加载

评论 #38405355 未加载

评论 #38405534 未加载

评论 #38405629 未加载

alecco超过 1 年前

Original tweet: <a href="https://twitter.com/SawyerMerritt/status/1696011140508045660" rel="nofollow noreferrer">https://twitter.com/SawyerMerritt/status/1696011140508045660</a>Previus article: <a href="https://www.tomshardware.com/news/teslas-dollar300-million-ai-cluster-is-going-live-today" rel="nofollow noreferrer">https://www.tomshardware.com/news/teslas-dollar300-million-a...</a>This is second-hand blogspam.

评论 #38405220 未加载

评论 #38405367 未加载

dahart超过 1 年前

> This AI cluster, worth more than $300 million, will offer a peak performance of 340 FP64 PFLOPS for technical computing and 39.58 INT8 ExaFLOPS for AI applications, according to Tom’s Hardware.I was curious why this statement lead with fp64 flops (instead of fp32, perhaps), but I looked up the H100 specs, and NV’s marketing page does the same thing. They’re obviously talking about the H100 SXM here, which has the same peak theoretical fp64 throughput as fp32. The cluster perf is estimated by multiplying the GPU perf by 10k.Also, obviously, int8 tensor ops aren’t ‘FLOPS’. I think Nvidia calls them “TOPS” (tensor ops). There is a separate metric for ‘tensor flops’ or TF32.

评论 #38405544 未加载

评论 #38405481 未加载

throwaway4good超过 1 年前

I predict it will run for 5 years and then come up with the answer: FSD needs lidar.

kaycebasques超过 1 年前

n00b questions from someone just beginning to get interested in HPCI see mention of using this supercomputer for training models. Is that the only purpose? What other types of things do orgs usually do with these supercomputers?Are there any good boots-on-the-ground technical blogs that provide interesting detail on day-to-day experiences with these things?

评论 #38405704 未加载

dsco超过 1 年前

Newbie question, could this cluster easily calculate the largest prime number? I've found that the largest known prime number was found back in 2018, which is a while back considering how compute has evolved.

评论 #38405999 未加载

cactusplant7374超过 1 年前

Is FSD really a hardware problem for them?

amai超过 1 年前

Do they also order a power plant for that cluster? Or how much energy does such a thing need?

bluelightning2k超过 1 年前

It's funny - I'm listening to "The Founders" audiobook and right now they're telling the story of Elon Musk at PayPal wanting to rewrite for Windows server because Linux was too hard.Weird to think that his next company's compute platform is this.

评论 #38405475 未加载

jbverschoor超过 1 年前

So THAT's why my power blipped

7e超过 1 年前

Only 10K?

评论 #38405340 未加载

评论 #38405289 未加载

kcb超过 1 年前

> The firm also built a compute cluster fitted with 5,760 Nvidia A100 GPUs in June 2012Wow, that's some really early hardware access. /s

评论 #38405485 未加载

评论 #38405532 未加载

评论 #38405385 未加载

toomuchtodo超过 1 年前

The Dojo is open.

评论 #38405205 未加载

评论 #38405253 未加载

andrewmcwatters超过 1 年前

Can you imagine how much power 10,000 H100s actually produces in production? I bet you'd be able to run modern games on a cluster that large at a full 60 FPS.

Kevcmk超过 1 年前

Nvidia is powering a mega Tesla supercomputer powered by 10,000 H100 GPUs

评论 #38405416 未加载

17 条评论

md_超过 1 年前

评论 #38405812 未加载

评论 #38405600 未加载

chollida1超过 1 年前

评论 #38405462 未加载

评论 #38405884 未加载

评论 #38405907 未加载

评论 #38405454 未加载

评论 #38406064 未加载

评论 #38409074 未加载

评论 #38405412 未加载

jdiez17超过 1 年前

评论 #38405426 未加载

评论 #38405355 未加载

评论 #38405534 未加载

评论 #38405629 未加载

alecco超过 1 年前

评论 #38405220 未加载

评论 #38405367 未加载

dahart超过 1 年前

评论 #38405544 未加载

评论 #38405481 未加载

throwaway4good超过 1 年前

I predict it will run for 5 years and then come up with the answer: FSD needs lidar.

kaycebasques超过 1 年前

评论 #38405704 未加载

dsco超过 1 年前

评论 #38405999 未加载

cactusplant7374超过 1 年前

Is FSD really a hardware problem for them?

amai超过 1 年前

Do they also order a power plant for that cluster? Or how much energy does such a thing need?

bluelightning2k超过 1 年前

评论 #38405475 未加载

jbverschoor超过 1 年前

So THAT's why my power blipped

7e超过 1 年前

Only 10K?

评论 #38405340 未加载

评论 #38405289 未加载

kcb超过 1 年前

> The firm also built a compute cluster fitted with 5,760 Nvidia A100 GPUs in June 2012Wow, that's some really early hardware access. /s

评论 #38405485 未加载

评论 #38405532 未加载

评论 #38405385 未加载

toomuchtodo超过 1 年前

The Dojo is open.

评论 #38405205 未加载

评论 #38405253 未加载

andrewmcwatters超过 1 年前

Can you imagine how much power 10,000 H100s actually produces in production? I bet you'd be able to run modern games on a cluster that large at a full 60 FPS.

Kevcmk超过 1 年前

Nvidia is powering a mega Tesla supercomputer powered by 10,000 H100 GPUs

评论 #38405416 未加载

Tesla turns on 10k-node Nvidia H100 Cluster

17 条评论

Tesla turns on 10k-node Nvidia H100 Cluster

17 条评论