TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

No More Adam: Learning Rate Scaling at Initialization Is All You Need

91 点作者 jinqueeny5 个月前

5 条评论

akos235 个月前
I don&#x27;t find this very convincing, both from a mathematical and experimental standpoint.<p>It seems their method is equivalent to SGD where the learning rate of each tensor is scaled by the number of elements in the tensor. The supposed &quot;Signal-to-Noise ratio&quot; they use is just gSNR=norm(g)&#x2F;RMS(g-mean(g)), where g is the gradient w.r.t. a d-dimensional tensor and the mean is computed across the elements of g. For a zero-mean iid random gradient the elementwise mean(g)≈0. A similar argument probably holds for arbitrary, but not completely random high-dimensional gradients, mean(g)≈0. In this case gSNR=sqrt(d), which explains why it is constant over time and how it varies across the components of the network.<p>It also seems the optimal value of their hyperparameter sweeps occurs at the edge in almost every case, and a granularity of 10x for the learning rate and weight decay is too large to make direct comparisons anyway.
eden-u45 个月前
Tried the source code on a toy model: adam took 2 epochs to train a 10k parameters model, this didn&#x27;t achieve anything useful in 20.<p>Tweaked a bit the hyper parameters and such, but nothing. Probably a bogus implementation?
评论 #42450091 未加载
评论 #42450506 未加载
评论 #42449867 未加载
评论 #42449615 未加载
spenrose5 个月前
Something we need is no more papers titled &quot; ... All You Need&quot;
评论 #42454271 未加载
评论 #42457451 未加载
评论 #42452031 未加载
评论 #42451580 未加载
评论 #42451970 未加载
rob_c5 个月前
Interesting take but:<p>After a reread it&#x27;s nice to see the optimizer is faster but how long is spent in the optimizer and can adamw be tuned for a low memory environment given its greedy to try and reduce the impact of statistical noise on gradient calculations.<p>Note that when training on image1k it only becomes comparable to adamw after many epoch and infact performs measurably worse for most of the training session. (How significant that is, is up to debate and model&#x2F;task&#x2F;data)<p>Why not incorporate 2nd order changes into adamw directly?<p>The lower memory footprint is nice but it&#x27;s not immediately why this is the case. Is the batch size reduced? Model changed? I&#x27;ll be read this after a 2nd coffee and see if it is more obvious...<p>Still promising if true.
评论 #42449410 未加载
amunozo5 个月前
It&#x27;s time to stop the &quot;All You Need&quot; titles. This one does not even sound good .
评论 #42450240 未加载
评论 #42450431 未加载