TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

No More Adam: Learning Rate Scaling at Initialization Is All You Need

91 pointsby jinqueeny5 months ago

5 comments

akos235 months ago
I don&#x27;t find this very convincing, both from a mathematical and experimental standpoint.<p>It seems their method is equivalent to SGD where the learning rate of each tensor is scaled by the number of elements in the tensor. The supposed &quot;Signal-to-Noise ratio&quot; they use is just gSNR=norm(g)&#x2F;RMS(g-mean(g)), where g is the gradient w.r.t. a d-dimensional tensor and the mean is computed across the elements of g. For a zero-mean iid random gradient the elementwise mean(g)≈0. A similar argument probably holds for arbitrary, but not completely random high-dimensional gradients, mean(g)≈0. In this case gSNR=sqrt(d), which explains why it is constant over time and how it varies across the components of the network.<p>It also seems the optimal value of their hyperparameter sweeps occurs at the edge in almost every case, and a granularity of 10x for the learning rate and weight decay is too large to make direct comparisons anyway.
eden-u45 months ago
Tried the source code on a toy model: adam took 2 epochs to train a 10k parameters model, this didn&#x27;t achieve anything useful in 20.<p>Tweaked a bit the hyper parameters and such, but nothing. Probably a bogus implementation?
评论 #42450091 未加载
评论 #42450506 未加载
评论 #42449867 未加载
评论 #42449615 未加载
spenrose5 months ago
Something we need is no more papers titled &quot; ... All You Need&quot;
评论 #42454271 未加载
评论 #42457451 未加载
评论 #42452031 未加载
评论 #42451580 未加载
评论 #42451970 未加载
rob_c5 months ago
Interesting take but:<p>After a reread it&#x27;s nice to see the optimizer is faster but how long is spent in the optimizer and can adamw be tuned for a low memory environment given its greedy to try and reduce the impact of statistical noise on gradient calculations.<p>Note that when training on image1k it only becomes comparable to adamw after many epoch and infact performs measurably worse for most of the training session. (How significant that is, is up to debate and model&#x2F;task&#x2F;data)<p>Why not incorporate 2nd order changes into adamw directly?<p>The lower memory footprint is nice but it&#x27;s not immediately why this is the case. Is the batch size reduced? Model changed? I&#x27;ll be read this after a 2nd coffee and see if it is more obvious...<p>Still promising if true.
评论 #42449410 未加载
amunozo5 months ago
It&#x27;s time to stop the &quot;All You Need&quot; titles. This one does not even sound good .
评论 #42450240 未加载
评论 #42450431 未加载