Distributed training infra/libs have made insane progress since the Megatron era.
I have worked with Megatron codebase to train larger than 175B models a few years back, a lot of the boilerplate that you find in those 20k LoC you could remove today by just importing deepspeed or other distributed training libs.<p>Cerebras' point still stands though, even if you can get the LoC count down significantly nowadays, it's still a major PITA to debug those systems, deal with node crashing, tweak the architecture and the data-loading pipeline to have high GPU utilization, optimize network bottlenecks etc.
Scaling vertically first like Cerebras is doing surely makes that much easier.<p>On a tangentially related note, this is imho where OpenAI has built it's moat: training and inference stack that they have refined over the last 6 years. They have good researchers, but so does MS, Google and Meta. But no one else has the ability to train such large models with such ease. Same for the inference stack, being able to run GPT-3.5/4 in prod at the scale at which they are doing it is no joke, and I'm 100% convinced this is why Gemini is still not widely available a year after 3.5 came out.