I don't find this very convincing, both from a mathematical and experimental standpoint.<p>It seems their method is equivalent to SGD where the learning rate of each tensor is scaled by the number of elements in the tensor. The supposed "Signal-to-Noise ratio" they use is just gSNR=norm(g)/RMS(g-mean(g)), where g is the gradient w.r.t. a d-dimensional tensor and the mean is computed across the elements of g. For a zero-mean iid random gradient the elementwise mean(g)≈0. A similar argument probably holds for arbitrary, but not completely random high-dimensional gradients, mean(g)≈0. In this case gSNR=sqrt(d), which explains why it is constant over time and how it varies across the components of the network.<p>It also seems the optimal value of their hyperparameter sweeps occurs at the edge in almost every case, and a granularity of 10x for the learning rate and weight decay is too large to make direct comparisons anyway.