科技回声

3 条评论

munchler大约 2 个月前

In the ML problem I'm working on now, there are about a dozen simple hyperparameters, and each training run takes hours or even days. I don't think there's any good way to search the space of hyperparameters without a deep understanding of the problem domain, and even then I'm often surprised when a minor config tweak yields better results (or fails to). Many of these hyperparameters affect performance directly and are very sensitive to hardware limits, so a bad value leads to an out-of-memory error in one direction or a runtime measured in years in the other. It's a real-world halting problem on steroids.This is not to even mention more complex design decisions, like the architecture of the model, which can't be captured in a simple hyperparameter.

评论 #43481459 未加载

评论 #43478155 未加载

评论 #43479049 未加载

评论 #43479874 未加载

评论 #43480447 未加载

jn2clark大约 2 个月前

How does it compare to previous work on learning to learn? I don't see it referenced <a href="https://arxiv.org/abs/1606.04474" rel="nofollow">https://arxiv.org/abs/1606.04474</a>

MITSardine大约 2 个月前

Great article, very clear. I was particularly curious because I've observed functions of the formf : x -> argmin_y g(x,y)are generally not smooth, like not even continuous, even though, as the article points out, every step of the program to approximate argmin is differentiable.There are several issues with this kind of function. First, it's not even necessarily well defined. There could be two or more argmin values for each x value. Your algorithm will find just the one, but the underlying mathematical object is still ill-defined.A second issue is that optimization algorithms don't converge to argmin to begin with, but to local minima. The value can differ dramatically in a completely unpredictable way. "Converge" is also an important word, as they stop before reaching the exact value; if g has mins that vary by less than your floating point representation can handle (for an extreme case), the problem is effectively numerically ill-posed even though it could be mathematically well-posed.Lastly, even when everything is well behaved, that is single global optimum, ideal optimizer that finds it exactly, g is infinitely smooth, then consider:- g(x,y) = sin(x)cos(y) defined on [-pi/2,pi/2]^2- for x in ]0,pi/2], sin(x) > 0, thus sin(x)cos(y) is minimized for y = -pi/2- for x in [-pi/2,0[, sin(x) < 0, thus sin(x)cos(y) is minimized for y = pi/2It follows that the function f is -pi/2 Heaviside, which is not even continuous. In conclusion, even when everything is very simple, perfectly behaved, any approach that relies on differentiating the result of a minimization procedure is standing on some very fragile scaffolding.For these reasons, I find the article stops at the most important part: it assesses the smoothness of such a function f empirically, but it does not explain how you might go about regularizing one found in the wild. This is the most crucial step in my opinion, because otherwise we're still somewhat stuck at tinkering effectively blindly in most cases. I don't mean any disrespect of what's otherwise very interesting and clearly thorough work, but since this is posted on a "layman forum", I just meant to point out what I think is an important limitation to how far-reaching the work here is.---Typo in Definition 2: 2hv? Or is this related to h(\Delta_f(z+h;h) - \Delta_f(z;h))? (unsure if it's the authors posting this)I personally also find the Delta_f notation confusing, I think a more elegant notation coherent with the paper's other notations would have been\frac{\delta f}{\delta v}(z;h)Note that df/dx is a directional derivative wrt the basis vector x, it's no trouble putting any v there. Lowercase delta is often used as a discrete difference, so I think this fits the finite difference gradient well. As for the (z;h), this mirrors how the differential is often written: at z and applied to h. (the h is lost in the article's \Delta_f notation)Question on Definition 2: why take the sign of the vectors, and then compute a weighted scalar product of that? not just compute the scalar product of the raw vectors and normalize by their norms? This would be in [-1,1] as well, so I'm curious why this particular measure was chosen.

3 条评论

munchler大约 2 个月前

评论 #43481459 未加载

评论 #43478155 未加载

评论 #43479049 未加载

评论 #43479874 未加载

评论 #43480447 未加载

jn2clark大约 2 个月前

How does it compare to previous work on learning to learn? I don't see it referenced <a href="https://arxiv.org/abs/1606.04474" rel="nofollow">https://arxiv.org/abs/1606.04474</a>

MITSardine大约 2 个月前

Optimizing ML training with metagradient descent

3 条评论

Optimizing ML training with metagradient descent

3 条评论