科技回声

7 条评论

xnx超过 1 年前

Full title: "Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips"<p>Summary from Bard: "This article is about training large language models (LLMs) on Google Cloud TPUs. It discusses the challenges of training LLMs at scale, and how Google Cloud TPU Multislice Training addresses these challenges. The article also details the results of a recent experiment in which Google trained a 128B parameter LLM on 50,944 TPU v5e chips. This experiment is the largest publicly disclosed LLM distributed training job to date."

jrk超过 1 年前

As far as I can tell, the article notably never defines what "slices" are or what "multi-slice" means.

评论 #38224140 未加载

leumassuehtam超过 1 年前

Did they use this to train Gemini? Which raises the question, where is Gemini?

评论 #38224480 未加载

评论 #38227460 未加载

DavidSJ超过 1 年前

Question for rwitten or anyone else involved in this project:<p>I see a per-device batch size of 6 for the 16B model. With 256x199 = 50944 TPUs and a sequence length of 2048, this works out to 104M tokens per batch. This is much larger than typical for training runs of dense LMs of this size, which are usually closer to ~4M tokens per batch.<p>Was your critical batch size really this large? In other words, did you really see a benefit as compared to a much smaller batch size (and probably many fewer TPUs)? Did you use some special learning rate schedule or optimizer to achieve this?

评论 #38232077 未加载

评论 #38225870 未加载

sashank_1509超过 1 年前

Ok so they claim in the article, 50000 TPU’s is equivalent to 10 exaflop floating point computations. That is equivalent to ~2,512 NVIDIA H100’s, which is like really small. Just shows the difference between TPU’s and GPU’s I guess. Inflection, a new LLM company created a 20,000 H100 cluster, I’m positive OpenAI, Tesla, Meta etc have orchestrated a job on more than 2500 H100 GPU’s.

评论 #38224065 未加载

评论 #38223859 未加载

评论 #38224387 未加载

评论 #38224024 未加载

评论 #38224168 未加载

jeffbee超过 1 年前

Something that doesn't seem worth bragging about is that the startup time increases linearly with the cluster size. Wouldn't you want it to be constant? What's the issue there?

评论 #38223979 未加载

评论 #38224104 未加载

behnamoh超过 1 年前

Can someone ELI5 this?

评论 #38223574 未加载

评论 #38223586 未加载