This is pretty cool. It sounds like they have found a way to do random initialization and pair it with a learning rate that makes the loss curve invariant to the width of the network. Then they can do hyperparameter optimization on a much smaller version of the network and trust that those hyperparameters will also be optimal on the full size network.