TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Optimizing ML training with metagradient descent

83 点作者 ladberg大约 2 个月前

3 条评论

munchler大约 2 个月前
In the ML problem I&#x27;m working on now, there are about a dozen simple hyperparameters, and each training run takes hours or even days. I don&#x27;t think there&#x27;s any good way to search the space of hyperparameters without a deep understanding of the problem domain, and even then I&#x27;m often surprised when a minor config tweak yields better results (or fails to). Many of these hyperparameters affect performance directly and are very sensitive to hardware limits, so a bad value leads to an out-of-memory error in one direction or a runtime measured in years in the other. It&#x27;s a real-world halting problem on steroids.<p>This is not to even mention more complex design decisions, like the architecture of the model, which can&#x27;t be captured in a simple hyperparameter.
评论 #43481459 未加载
评论 #43478155 未加载
评论 #43479049 未加载
评论 #43479874 未加载
评论 #43480447 未加载
jn2clark大约 2 个月前
How does it compare to previous work on learning to learn? I don&#x27;t see it referenced <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1606.04474" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1606.04474</a>
MITSardine大约 2 个月前
Great article, very clear. I was particularly curious because I&#x27;ve observed functions of the form<p>f : x -&gt; argmin_y g(x,y)<p>are generally not smooth, like not even continuous, even though, as the article points out, every step of the program to approximate argmin is differentiable.<p>There are several issues with this kind of function. First, it&#x27;s not even necessarily well defined. There could be two or more argmin values for each x value. Your algorithm will find just the one, but the underlying mathematical object is still ill-defined.<p>A second issue is that optimization algorithms don&#x27;t converge to argmin to begin with, but to local minima. The value can differ dramatically in a completely unpredictable way. &quot;Converge&quot; is also an important word, as they stop before reaching the exact value; if g has mins that vary by less than your floating point representation can handle (for an extreme case), the problem is effectively numerically ill-posed even though it could be mathematically well-posed.<p>Lastly, even when everything is well behaved, that is single global optimum, ideal optimizer that finds it exactly, g is infinitely smooth, then consider:<p>- g(x,y) = sin(x)cos(y) defined on [-pi&#x2F;2,pi&#x2F;2]^2<p>- for x in ]0,pi&#x2F;2], sin(x) &gt; 0, thus sin(x)cos(y) is minimized for y = -pi&#x2F;2<p>- for x in [-pi&#x2F;2,0[, sin(x) &lt; 0, thus sin(x)cos(y) is minimized for y = pi&#x2F;2<p>It follows that the function f is -pi&#x2F;2 Heaviside, which is not even continuous. In conclusion, even when everything is very simple, perfectly behaved, any approach that relies on differentiating the result of a minimization procedure is standing on some very fragile scaffolding.<p>For these reasons, I find the article stops at the most important part: it assesses the smoothness of such a function f empirically, but it does not explain how you might go about regularizing one found in the wild. This is the most crucial step in my opinion, because otherwise we&#x27;re still somewhat stuck at tinkering effectively blindly in most cases. I don&#x27;t mean any disrespect of what&#x27;s otherwise very interesting and clearly thorough work, but since this is posted on a &quot;layman forum&quot;, I just meant to point out what I think is an important limitation to how far-reaching the work here is.<p>---<p>Typo in Definition 2: 2hv? Or is this related to h(\Delta_f(z+h;h) - \Delta_f(z;h))? (unsure if it&#x27;s the authors posting this)<p>I personally also find the Delta_f notation confusing, I think a more elegant notation coherent with the paper&#x27;s other notations would have been<p>\frac{\delta f}{\delta v}(z;h)<p>Note that df&#x2F;dx is a directional derivative wrt the basis vector x, it&#x27;s no trouble putting any v there. Lowercase delta is often used as a discrete difference, so I think this fits the finite difference gradient well. As for the (z;h), this mirrors how the differential is often written: at z and applied to h. (the h is lost in the article&#x27;s \Delta_f notation)<p>Question on Definition 2: why take the sign of the vectors, and then compute a weighted scalar product of that? not just compute the scalar product of the raw vectors and normalize by their norms? This would be in [-1,1] as well, so I&#x27;m curious why this particular measure was chosen.