TechEcho

13 comments

FL33TW00Dabout 4 years ago

Huggingface has been working on implementing this into their library, and it has some pretty amazing effects on the size of models you can train on a simple Colab.<a href="https://huggingface.co/blog/zero-deepspeed-fairscale" rel="nofollow">https://huggingface.co/blog/zero-deepspeed-fairscale</a>

stephenrollerabout 4 years ago

Support for this was also added to [Fairscale](<a href="https://fairscale.readthedocs.io/en/latest/" rel="nofollow">https://fairscale.readthedocs.io/en/latest/</a>) and [Fairseq](<a href="https://github.com/pytorch/fairseq" rel="nofollow">https://github.com/pytorch/fairseq</a>) last week. In particular, the Fairscale implementation can be used in any pyotrch project without requiring the use of the Deepspeed trainer.

评论 #26449737 未加载

anskabout 4 years ago

Question for someone knowledgable about this: if I have a model which is large -- but small enough that I can fit a single training example on GPU -- does this approach offer speedups compared to simple gradient accumulation? Or is this only useful for models which are so large that the model parameters themselves are overwhelming GPU memory?

joshlkabout 4 years ago

GPT-NeoX is an example project that is using deepspeed and Zero-3 offloading. The wider project intend to train a GPT-3 sized model and release it freely to the world.<a href="https://github.com/EleutherAI/gpt-neox" rel="nofollow">https://github.com/EleutherAI/gpt-neox</a>

评论 #26447544 未加载

dataangelabout 4 years ago

ELI5? All this techno babble just sounds like "it's faster because we optimized it". What are the nontrivial, new fundamental tricks?

评论 #26450397 未加载

评论 #26448632 未加载

bevenkyabout 4 years ago

This is also being added to pytorch<a href="https://github.com/pytorch/pytorch/pull/46750" rel="nofollow">https://github.com/pytorch/pytorch/pull/46750</a>

评论 #26450637 未加载

alphagrep12345about 4 years ago

Simple 10 min overview/tutorial (official) if someone is interested - <a href="https://www.youtube.com/watch?v=ovQC7FqXHXk" rel="nofollow">https://www.youtube.com/watch?v=ovQC7FqXHXk</a>

The_rationalistabout 4 years ago

See also zeroth order backpropagation which allows 300X faster training while not reducing throughput that much <a href="https://arxiv.org/abs/2011.08895" rel="nofollow">https://arxiv.org/abs/2011.08895</a> How much zero-3 affect accuracy?See also <a href="https://github.com/microsoft/fastformers" rel="nofollow">https://github.com/microsoft/fastformers</a>

vladfabout 4 years ago

Alternatively, one could get rid of the memory used by optimizers entirely by switching to vanilla SGD.I haven’t tried this on transformers and maybe that’s what breaks down here but in “classic” supervised settings I’ve found SGD with schedule tuning just as fast as Adam.

评论 #26448858 未加载

andrewprockabout 4 years ago

How much data do you need to mitigate the risk of over fitting a trillion parameter model?

评论 #26448876 未加载

singhracabout 4 years ago

For those searching, DeepSpeed is implemented as a set of C++/CUDA extensions on top of PyTorch (compiled using their JIT).

bionhowardabout 4 years ago

please hook this up to Jax!

mchusmaabout 4 years ago

This is super impressive. I could not figure out for a while who exactly was running this project, but it looks like its Microsoft. Great work!

13 comments

FL33TW00Dabout 4 years ago

stephenrollerabout 4 years ago

评论 #26449737 未加载

anskabout 4 years ago

joshlkabout 4 years ago

评论 #26447544 未加载

dataangelabout 4 years ago

ELI5? All this techno babble just sounds like "it's faster because we optimized it". What are the nontrivial, new fundamental tricks?

评论 #26450397 未加载

评论 #26448632 未加载

bevenkyabout 4 years ago

This is also being added to pytorch<a href="https://github.com/pytorch/pytorch/pull/46750" rel="nofollow">https://github.com/pytorch/pytorch/pull/46750</a>

评论 #26450637 未加载

alphagrep12345about 4 years ago

Simple 10 min overview/tutorial (official) if someone is interested - <a href="https://www.youtube.com/watch?v=ovQC7FqXHXk" rel="nofollow">https://www.youtube.com/watch?v=ovQC7FqXHXk</a>

The_rationalistabout 4 years ago

vladfabout 4 years ago

评论 #26448858 未加载

andrewprockabout 4 years ago

How much data do you need to mitigate the risk of over fitting a trillion parameter model?

评论 #26448876 未加载

singhracabout 4 years ago

For those searching, DeepSpeed is implemented as a set of C++/CUDA extensions on top of PyTorch (compiled using their JIT).

bionhowardabout 4 years ago

please hook this up to Jax!

mchusmaabout 4 years ago

This is super impressive. I could not figure out for a while who exactly was running this project, but it looks like its Microsoft. Great work!

Zero-3 Offload: Scale DL models to trillion parameters without code changes

13 comments

Zero-3 Offload: Scale DL models to trillion parameters without code changes

13 comments