I don't find this very convincing, both from a mathematical and experimental standpoint.<p>It seems their method is equivalent to SGD where the learning rate of each tensor is scaled by the number of elements in the tensor. The supposed "Signal-to-Noise ratio" they use is just gSNR=norm(g)/RMS(g-mean(g)), where g is the gradient w.r.t. a d-dimensional tensor and the mean is computed across the elements of g. For a zero-mean iid random gradient the elementwise mean(g)≈0. A similar argument probably holds for arbitrary, but not completely random high-dimensional gradients, mean(g)≈0. In this case gSNR=sqrt(d), which explains why it is constant over time and how it varies across the components of the network.<p>It also seems the optimal value of their hyperparameter sweeps occurs at the edge in almost every case, and a granularity of 10x for the learning rate and weight decay is too large to make direct comparisons anyway.
Tried the source code on a toy model: adam took 2 epochs to train a 10k parameters model, this didn't achieve anything useful in 20.<p>Tweaked a bit the hyper parameters and such, but nothing. Probably a bogus implementation?
Interesting take but:<p>After a reread it's nice to see the optimizer is faster but how long is spent in the optimizer and can adamw be tuned for a low memory environment given its greedy to try and reduce the impact of statistical noise on gradient calculations.<p>Note that when training on image1k it only becomes comparable to adamw after many epoch and infact performs measurably worse for most of the training session.
(How significant that is, is up to debate and model/task/data)<p>Why not incorporate 2nd order changes into adamw directly?<p>The lower memory footprint is nice but it's not immediately why this is the case. Is the batch size reduced? Model changed? I'll be read this after a 2nd coffee and see if it is more obvious...<p>Still promising if true.