Huggingface has been working on implementing this into their library, and it has some pretty amazing effects on the size of models you can train on a simple Colab.<p><a href="https://huggingface.co/blog/zero-deepspeed-fairscale" rel="nofollow">https://huggingface.co/blog/zero-deepspeed-fairscale</a>
Support for this was also added to [Fairscale](<a href="https://fairscale.readthedocs.io/en/latest/" rel="nofollow">https://fairscale.readthedocs.io/en/latest/</a>) and [Fairseq](<a href="https://github.com/pytorch/fairseq" rel="nofollow">https://github.com/pytorch/fairseq</a>) last week. In particular, the Fairscale implementation can be used in any pyotrch project without requiring the use of the Deepspeed trainer.
Question for someone knowledgable about this: if I have a model which is large -- but small enough that I can fit a single training example on GPU -- does this approach offer speedups compared to simple gradient accumulation? Or is this only useful for models which are so large that the model parameters themselves are overwhelming GPU memory?
GPT-NeoX is an example project that is using deepspeed and Zero-3 offloading. The wider project intend to train a GPT-3 sized model and release it freely to the world.<p><a href="https://github.com/EleutherAI/gpt-neox" rel="nofollow">https://github.com/EleutherAI/gpt-neox</a>
This is also being added to pytorch<p><a href="https://github.com/pytorch/pytorch/pull/46750" rel="nofollow">https://github.com/pytorch/pytorch/pull/46750</a>
Simple 10 min overview/tutorial (official) if someone is interested - <a href="https://www.youtube.com/watch?v=ovQC7FqXHXk" rel="nofollow">https://www.youtube.com/watch?v=ovQC7FqXHXk</a>
See also zeroth order backpropagation which allows 300X faster training while not reducing throughput that much
<a href="https://arxiv.org/abs/2011.08895" rel="nofollow">https://arxiv.org/abs/2011.08895</a>
How much zero-3 affect accuracy?<p>See also <a href="https://github.com/microsoft/fastformers" rel="nofollow">https://github.com/microsoft/fastformers</a>
Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.<p>I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.