Full title: "Google Cloud demonstrates the world’s largest distributed training job for large language models across 50000+ TPU v5e chips"<p>Summary from Bard: "This article is about training large language models (LLMs) on Google Cloud TPUs. It discusses the challenges of training LLMs at scale, and how Google Cloud TPU Multislice Training addresses these challenges. The article also details the results of a recent experiment in which Google trained a 128B parameter LLM on 50,944 TPU v5e chips. This experiment is the largest publicly disclosed LLM distributed training job to date."
Question for rwitten or anyone else involved in this project:<p>I see a per-device batch size of 6 for the 16B model. With 256x199 = 50944 TPUs and a sequence length of 2048, this works out to 104M tokens per batch. This is much larger than typical for training runs of dense LMs of this size, which are usually closer to ~4M tokens per batch.<p>Was your critical batch size really this large? In other words, did you really see a benefit as compared to a much smaller batch size (and probably many fewer TPUs)? Did you use some special learning rate schedule or optimizer to achieve this?
Ok so they claim in the article, 50000 TPU’s is equivalent to 10 exaflop floating point computations. That is equivalent to ~2,512 NVIDIA H100’s, which is like really small. Just shows the difference between TPU’s and GPU’s I guess. Inflection, a new LLM company created a 20,000 H100 cluster, I’m positive OpenAI, Tesla, Meta etc have orchestrated a job on more than 2500 H100 GPU’s.
Something that doesn't seem worth bragging about is that the startup time increases linearly with the cluster size. Wouldn't you want it to be constant? What's the issue there?