科技回声

6 条评论

gwern超过 1 年前

Heh. This problem reminds me of back in 2019 when I was working with Shawn Presser on finetuning GPT-2 using Google Colab - there was a problem where it would randomly error out every once in a while, but also it would take like 10 minutes to redownload the last saved checkpoint from our server IIRC and it would take minutes to save the current checkpoint, so the question was, how often should we save to minimize the time spent restoring+saving? I did a bit of algebra and I think we wound up with an answer like '40 minutes'!DL infrastructure & training practices have gotten better since then...

solidasparagus超过 1 年前

Interesting work! This is really an engineering achievement and I wish there was usable code. Real-time checkpointing seems like obviously the future to me, but it's going to be an easy-to-use, high-performance implementation that make that reality.One of the things I would like to have seen in the paper is a better analysis of simply checkpointing more often. It's briefly touched on:> It is infeasible to arbitrarily increase the checkpoint frequency because checkpoint frequency is bottlenecked by the bandwidth of the remote persistent storage [28]. For example, it takes 42 minutes to checkpoint the model states of MT-NLG [68] to the remote persistent storage when the bandwidth is 20Gbps.and> Both baselines, Strawman and HighFreq, have the same checkpoint time and it stays almost the same as the number of machines increases from 4 to 16 because the aggregated bandwidth of the remote persistent storage is fixedBut that smells a bit off to me. That's a 530B model (unrealistically large given current trends IMO) where each model replica has 280 A100s and then there is data parallelism on top. Where exactly are you storing your (sharded?) checkpoint where the read/write bandwidth isn't also scaling horizontally beyond 20Gbps?

评论 #39078719 未加载

rs42超过 1 年前

It is also worth checking Microsoft Singularity (aka project forge) <a href="https://arxiv.org/abs/2202.07848" rel="nofollow">https://arxiv.org/abs/2202.07848</a> <a href="https://youtu.be/c4SUhWBybXo" rel="nofollow">https://youtu.be/c4SUhWBybXo</a>

pavelstoev超过 1 年前

Very interesting. Well done authors!

Mr_P超过 1 年前

For anyone else who was confused to see a paper use the same name as a commercial product, it looks like Google Gemini was announced in May, whereas this was submitted to SOSP that had an April submission deadline.

评论 #39076497 未加载

评论 #39075765 未加载

optimalsolver超过 1 年前

Using an original name will probably make it easier for people to find your paper.Maybe ask ChatGPT for ideas.

评论 #39075417 未加载

6 条评论

gwern超过 1 年前

solidasparagus超过 1 年前

评论 #39078719 未加载

rs42超过 1 年前

pavelstoev超过 1 年前

Very interesting. Well done authors!

Mr_P超过 1 年前

评论 #39076497 未加载

评论 #39075765 未加载

optimalsolver超过 1 年前

Using an original name will probably make it easier for people to find your paper.Maybe ask ChatGPT for ideas.

评论 #39075417 未加载

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints [pdf]

6 条评论

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints [pdf]

6 条评论