> Grimmer found that the fastest sequences always had one thing in common: The middle step was always a big one. Its size depended on the number of steps in the repeating sequence. For a three-step sequence, the big step had length 4.9. For a 15-step sequence, the algorithm recommended one step of length 29.7. And for a 127-step sequence, the longest one tested, the big central leap was a whopping 370. At first that sounds like an absurdly large number, Grimmer said, but there were enough total steps to make up for that giant leap, so even if you blew past the bottom, you could still make it back quickly. His paper showed that this sequence can arrive at the optimal point nearly three times faster than it would by taking constant baby steps. “Sometimes, you should really overcommit,” he said.<p>I wonder if this is a Rube Goldberg machine for imitating a second-order method. Gradient descent is slow because it can't correct for the curvature of the function, but maybe when you use a meta-algorithm to optimize the entire sequence of steps, you can arrive at something that approximates it. So the first half of the steps is getting you in position for a big approximate second-order step, and the second half is working to clean up the after-effects of that big step.
Critical bit:<p>> However, while these insights may change how researchers think about gradient descent, they likely won’t change how the technique is currently used. Grimmer’s paper focused only on smooth functions, which have no sharp kinks, and convex functions, which are shaped like a bowl and only have one optimal value at the bottom. These kinds of functions are fundamental to theory but less relevant in practice; the optimization programs machine learning researchers use are usually much more complicated...<p>And machine learning is by <i>no</i> means the only area in which the "real world" curves subject to optimization have, ah, "problematic" shapes.
Forgive me if this sounds naive but it sort of reminds me of life in a way. I equate these little steps to the way most of our industries or institutions focus on producing incremental and predictable progress. In such an environment many of the power gives way to manager and administrator type participants. Think of academia or how most of tech has become (ie sass).<p>On the other hand, while that is certainly useful, it pales in comparison to that being combined with a non-linear aoproach, where you’re not trying to optimize a known process but fundamentally rethink the question altogether. To me that’s reminiscent of physics in the early twentieth century, democracy and liberal economies during the enlightenment, and even modern literature, in a sense, like Joyce.<p>I don’t really have a point beyond that, and it’s just an analogy after all. It’s still neat to think about and recognize similarities between a discovery in the computational world and a personal observation about our own world. Even if it’s a misguided analogy, it feels true.
Their mathematical analysis is for sure beyond my understanding, but on some level, I'm surprised this is surprising. While baby steps may be the norm in gradient descent, there are plenty of there optimization techniques that embrace larger step sizes. Some, like simulated annealing, explicitly model changes in step size over time. Others, like genetic algorithms with crossover let the balance between exploration (big steps) vs exploitation (small steps) emerge more organically.<p>In fact, applying the idea of "no free lunch in search and optimization" (<a href="https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...</a>), you'd kind of expect this result: sometimes big steps good, sometimes big steps bad.
Hi! I've worked a lot with ML algorithms, and I believe this is tied in with regret. Not only do huge steps (after an initial warmup) get to a rough minimum faster, due to the energy involved, the minimum found tends to be more shallow, because the parameters jitter around so much during training.<p>It sort of makes sense too as you're approaching a finite goal in an optimization problem -- first, tiny steps to take any initial steps that might be huge due to any initialization differences, then annealing up to huge steps, then small steps again as you approach the end of your optimization process, and your signal:noise ratio is much lower.
The overshooting and correcting pattern (and 'fractal' therein) can be understood analogously by considering the binary search tree. The tree is fractal (i.e. sub trees) and each step may overshoot until the solution is found.<p>This analogy can be made rigorous by looking at bisection search on a convex function instead of a simple binary search tree. The same pattern holds. "Fractal" patterns occur because the problem is invariant under scaling.<p>What is interesting is the initial half of small steps. I have not read the original paper to give a professional view on the mathematics, but it may just be that initial small steps are priming the optimizer to find the right scale.<p>Unfortunately, it may just be that the findings are not something fundamental to gradient descent or optimization in general, but just an observed property of the smooth convex cost function considered under this particular algorithmic investigation.
To me this is a bit exemplary of how science sometimes gets functional myopia. Conceptionally obvious: Playing with step sizes gives you higher chance of getting closer to the global (not local) optimum initially, because you can scout , but you want to get your algorithm working, so let's slap on a heuristic and let's first converge for local optima with learning rate decay.
Then most people in the scientific community seem to forget (for example) about this heuristic, because it's Always Been There and direct all their focus on improving the status quo, forgetting that the assumption we put there is also a construction area. Lumpers and splitters, I guess. This sounds bitter, I'm actually happy Mr. Grimmer took this on!
Related, and known (at least empirically) for a while, Lévy flight sampling methods:<p><a href="https://en.wikipedia.org/wiki/L%C3%A9vy_flight" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/L%C3%A9vy_flight</a>
Leslie Smith has a number of great papers- this one is particularly relevant. I believe he coined the term “super-convergence” in the space of training neural networks. Same idea- using very large learning rates post-warmup can speed up training an order of magnitude. <a href="https://arxiv.org/abs/1708.07120" rel="nofollow noreferrer">https://arxiv.org/abs/1708.07120</a>
> Grimmer found that the fastest sequences always had one thing in common: The middle step was always a big one.<p>So... it's like binary-search, but you don't know the maximum before hand? (i.e. you double the step size until you overshoot, then do regular binary search from there)
I glanced at the paper. Does the paper focus on semidefinite programming, where the objective function is linear, subjects to linear inequalities constraints ?