Interesting work! This is really an engineering achievement and I wish there was usable code. Real-time checkpointing seems like obviously the future to me, but it's going to be an easy-to-use, high-performance implementation that make that reality.<p>One of the things I would like to have seen in the paper is a better analysis of simply checkpointing more often. It's briefly touched on:<p>> It is infeasible to arbitrarily increase the checkpoint frequency because checkpoint frequency is bottlenecked by the bandwidth of the remote persistent storage [28]. For example, it takes 42 minutes to checkpoint the model states of MT-NLG [68] to the remote persistent storage when the bandwidth is 20Gbps.<p>and<p>> Both baselines, Strawman and HighFreq, have the same checkpoint time and it stays almost the same as the number of machines increases from 4 to 16 because the aggregated bandwidth of the remote persistent storage is fixed<p>But that smells a bit off to me. That's a 530B model (unrealistically large given current trends IMO) where each model replica has 280 A100s and then there is data parallelism on top. Where exactly are you storing your (sharded?) checkpoint where the read/write bandwidth isn't also scaling horizontally beyond 20Gbps?