TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints [pdf]

50 点作者 mlerner超过 1 年前

6 条评论

gwern超过 1 年前
Heh. This problem reminds me of back in 2019 when I was working with Shawn Presser on finetuning GPT-2 using Google Colab - there was a problem where it would randomly error out every once in a while, but also it would take like 10 minutes to redownload the last saved checkpoint from our server IIRC and it would take minutes to save the current checkpoint, so the question was, how often should we save to minimize the time spent restoring+saving? I did a bit of algebra and I think we wound up with an answer like &#x27;40 minutes&#x27;!<p>DL infrastructure &amp; training practices have gotten better since then...
solidasparagus超过 1 年前
Interesting work! This is really an engineering achievement and I wish there was usable code. Real-time checkpointing seems like obviously the future to me, but it&#x27;s going to be an easy-to-use, high-performance implementation that make that reality.<p>One of the things I would like to have seen in the paper is a better analysis of simply checkpointing more often. It&#x27;s briefly touched on:<p>&gt; It is infeasible to arbitrarily increase the checkpoint frequency because checkpoint frequency is bottlenecked by the bandwidth of the remote persistent storage [28]. For example, it takes 42 minutes to checkpoint the model states of MT-NLG [68] to the remote persistent storage when the bandwidth is 20Gbps.<p>and<p>&gt; Both baselines, Strawman and HighFreq, have the same checkpoint time and it stays almost the same as the number of machines increases from 4 to 16 because the aggregated bandwidth of the remote persistent storage is fixed<p>But that smells a bit off to me. That&#x27;s a 530B model (unrealistically large given current trends IMO) where each model replica has 280 A100s and then there is data parallelism on top. Where exactly are you storing your (sharded?) checkpoint where the read&#x2F;write bandwidth isn&#x27;t also scaling horizontally beyond 20Gbps?
评论 #39078719 未加载
rs42超过 1 年前
It is also worth checking Microsoft Singularity (aka project forge) <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2202.07848" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2202.07848</a> <a href="https:&#x2F;&#x2F;youtu.be&#x2F;c4SUhWBybXo" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;c4SUhWBybXo</a>
pavelstoev超过 1 年前
Very interesting. Well done authors!
Mr_P超过 1 年前
For anyone else who was confused to see a paper use the same name as a commercial product, it looks like Google Gemini was announced in May, whereas this was submitted to SOSP that had an April submission deadline.
评论 #39076497 未加载
评论 #39075765 未加载
optimalsolver超过 1 年前
Using an original name will probably make it easier for people to find your paper.<p>Maybe ask ChatGPT for ideas.
评论 #39075417 未加载