I work in deep learning for 3D imaging, and memory has constantly been the primary bottleneck for our group. U-net for example tends to be fairly "chonky", and isn't really super great in terms of parameter efficiency (but it is nice when you need an out of the box network that just "works"...). This has led medical imaging to use a lot of "patching" and other sliding window sort of techniques to help get over this burden.<p>I tend to think that a lot of this is due to Facebook/Google/Etc being more interested in 2D picture images, and hasn't really put a ton of effort into developing approaches that are exponentially harder in terms of parameters. While I don't think I can comment on if parallelism is the future to solve (vs. single massive GPU memory chips vs. more efficient NN design vs. data compression techniques), I think this is where a lot of the bleeding technical edge will come from.
Even from the paper, it's hard to tell what this library actually does: Section 5 in <a href="https://arxiv.org/pdf/1910.02054.pdf" rel="nofollow">https://arxiv.org/pdf/1910.02054.pdf</a><p>The paper talks about parameter partitioning and overlapped communication, but doesn't actually give many details on how those things happen.<p>The library appears to be an implementation of some common algos for solving the 'pebble game,' as explained decently here: <a href="https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9" rel="nofollow">https://medium.com/tensorflow/fitting-larger-networks-into-m...</a><p>The essential point is that:<p>(1) model parallelism is hard to do and has historically been done manually to scale <i>wide</i> models across GPUs<p>(2) inter-GPU I/O is expensive for vanilla data-parallel jobs (that typically use naive mirroring strategies)<p>(3) researchers have figured out now how to 'compile' a <i>deep</i> model so that layers span GPUs and save on both memory usage and I/O<p>(4) so scaling <i>wide</i> models is still hard, but now we have better tools for <i>deep</i> models<p>Existing all-reduce-based data-parallel problems have already been well-studied (see e.g. <a href="https://people.eecs.berkeley.edu/~jfc/papers/14/Kylix.pdf" rel="nofollow">https://people.eecs.berkeley.edu/~jfc/papers/14/Kylix.pdf</a> ), so it's really nice to see gains through new techniques.<p>Definitely like seeing this 'compilation' being wrapped up into a library. Just wish they did a better job of communicating key ideas.
Looks like what it does is similar to what Alex did a few years back with One Weird Trick paper: <a href="https://arxiv.org/abs/1404.5997" rel="nofollow">https://arxiv.org/abs/1404.5997</a><p>When attempting to train transformers, I do notice a lot of time spend on allreduce more than with CNN models, probably due the parameter sizes. OWT seems to be natural to exploit for this situation (a lot of GEMMs, lot time spent on allreduce).<p>Edit:<p>Read the paper. The implementation is much less tricky than OWT, but for a good reason probably. Language model's GEMMs are smaller, therefore, partition the model would have efficiency impact (smaller GEMM will be slower). This does require much better interconnects, which NVLink / infiniband conveniently provides, that is also not available on consumer grade hardware anywhere (2-way NVLink is not meaningful).
ZeRO is mainly a clever improvement that moves optimizer computation into the 2 phases of Ring-AllReduce. It greatly helps Adam and similar optimizers to reduce per-GPU memory overhead.<p>The naive approach, as used in the well-known Megatron, completes Ring-AllReduce first so that each GPU has a full set of aggregated gradients. Then it does the same optimizer computation for all parameters on each GPU. That's OK for vanilla SGD because vanilla SGD has no optimizer state variable. But for Adam, the naive approach has to store a copy of full set of Adam m/v storage on each GPU, which is super memory consuming. Actually, after the 1st phase of All-Reduce each GPU has its subset of gradients. Each GPU can do Adam SGD for that subset, and importantly, it just need to keep m/v corresponding to that subset of gradients. After the Adam optimizer completes, the 2nd phase of Ring-AllReduce will scatter the updated parameters to all GPUs. Therefore, that's memory saving and computation saving. (The naive approach is more general as it allows optimizers to use optimizer variables of different network layers. But most optimizers, like Adam, don't really need that capability. ZeRO cleverly leveraged that locality.)
Link to GitHub: <a href="https://github.com/microsoft/DeepSpeed" rel="nofollow">https://github.com/microsoft/DeepSpeed</a>
I wrote a blog post on the difficulties of memory-efficient training, which seems relevant: <a href="http://mitchgordon.me/machine/learning/2020/01/13/do-we-really-need-model-compression.html" rel="nofollow">http://mitchgordon.me/machine/learning/2020/01/13/do-we-real...</a><p>The methods discussed there take a different angle at the problem.
While on the surface, this looks interesting, can anyone help me understand who exactly needs to do and redo neural network training that will take advantage of these optimizations? I’m struggling to understand which companies/data scientists would use this.