TBH I kinda agree with the argument that distributed training is too hard. Its so architecture/compute-resources/network-topology dependent that when people open those can of worms, they quickly realize that the cost/benefit tradeoff is limited unless you are doing large-scale pre-training. its just so much easier to train as much as possible on a single node