ZeRO is mainly a clever improvement that moves optimizer computation into the 2 phases of Ring-AllReduce. It greatly helps Adam and similar optimizers to reduce per-GPU memory overhead.<p>The naive approach, as used in the well-known Megatron, completes Ring-AllReduce first so that each GPU has a full set of aggregated gradients. Then it does the same optimizer computation for all parameters on each GPU. That's OK for vanilla SGD because vanilla SGD has no optimizer state variable. But for Adam, the naive approach has to store a copy of full set of Adam m/v storage on each GPU, which is super memory consuming. Actually, after the 1st phase of All-Reduce each GPU has its subset of gradients. Each GPU can do Adam SGD for that subset, and importantly, it just need to keep m/v corresponding to that subset of gradients. After the Adam optimizer completes, the 2nd phase of Ring-AllReduce will scatter the updated parameters to all GPUs. Therefore, that's memory saving and computation saving. (The naive approach is more general as it allows optimizers to use optimizer variables of different network layers. But most optimizers, like Adam, don't really need that capability. ZeRO cleverly leveraged that locality.)