The results of the experiment seem counterintuitive just because the used learning rates are huge (up to 10 or even 100). These are not the lr you would use in a normal setting. If you look at the region of small lr it seems all of them converge.<p>So I would say the experiment is interesting, but not representative of real world deep learning.<p>In the experiment, you have a function of 272 variables with a lot of minima and maxima, and at each gradient descent step you take huge steps (due to big lr). So my intuition is that convergence is more a matter of luck rather than hyperparameters.
Twitter: <a href="https://twitter.com/jaschasd/status/1756930242965606582" rel="nofollow">https://twitter.com/jaschasd/status/1756930242965606582</a>
ArXiv: <a href="https://arxiv.org/abs/2402.06184" rel="nofollow">https://arxiv.org/abs/2402.06184</a><p>Abstract:<p>"Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations."<p>Contains several cool animations zooming in to show the fractal boundary between convergent and divergent training, just like the classic Mandelbrot and Julia set animations.
I find this result absolutely fascinating, and is exactly the type of research into neural networks we should be expanding.<p>We've rapidly engineered our way to some <i>very</i> impressive models this past decade, and yet gap in our real understanding of what's going on has widened. There's a large list of very basic questions about LLMs that we haven't answered (or in some cases, really asked). This is not a failing of people researching in this area, it's only that things move so quickly there's not enough time to ponder things like this.<p>At the same time, the result, unless I'm really misunderstanding, gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.
If you are a fan of the fractals but feel intimidated by neural networks, the networks used here are actually pretty simple and not so difficult to understand if you are familiar with matrix multiplication. To generate a dataset, he samples random vectors (say of size 8) as inputs, and for each vector a target output, which is a single number. The network consists of an 8x8 matrix and an 8x1 matrix, also randomly initialized.<p>To generate an output from an input vector, you just multiply by your 8x8 matrix (getting a new size 8 vector), apply the tanh function to each element (look up a plot of tanh - it just squeezes its inputs to be between -1 and 1), and then multiply by the 8x1 matrix, getting a single value as an output. The elements of the two matrices are the 'weights' of the neural network, and they are updated to push the output we got towards the target.<p>When we update our weights, we have to decide on a step size - do we make just a little tiny nudge in the right direction, or take a giant step? The plots are showing what happens if we choose different step sizes for the two matrices ("input layer learning rate" is how big of a step we take for the 8x8 matrix, and "output layer learning rate" for the 8x1 matrix).<p>If your steps are too big, you run into a problem. Imagine trying to find the bottom of a parabola by taking steps in the direction of downward slope - if you take a giant step, you'll pass right over the bottom and land on the opposite slope, maybe even higher than you started! This is the red region of the plots. If you take really really tiny steps, you'll be safe, but it'll take you a long time to reach the bottom. This is the dark blue section. Another way you can take a long time is to take big steps that jump from one slope to the other, but just barely small enough to end up a little lower each time (this is why there's a dark blue stripe near the boundary). The light green region is where you take goldilocks steps - big enough to find the bottom quickly, but small enough to not jump over it.
Here's the associated blog post, which includes the videos: <a href="https://sohl-dickstein.github.io/2024/02/12/fractal.html" rel="nofollow">https://sohl-dickstein.github.io/2024/02/12/fractal.html</a><p>Not a ML'er so not sure what to make of it, beyond a fascinating connection.
This is really fun, and beautiful. Also, despite what people are saying about the learning rates being unrealistic, the findings also really fit well with my own experience in using optimisation algorithms in the real world. If our code ever had a significant results difference between processor architectures (e.g. a machine taking an avx code path vs an sse one) you could be sure that every time the difference began during execution of an optimisation algorithm. The chaotic sensitivity to initial conditions really showed up there, just as it did in the author's newton solver plot. Although I have knew at some level that this behaviour was chaotic it never would have occurred to me to ask if it made a pretty fractal!
I appreciate that his acknowledgements here were to his daughter ("for detailed feedback on the generated fractals") and wife ("for providing feedback on a draft of this post")
This is kind of random but- I wonder, if you had a sufficiently complex lens, or series of lenses, perhaps with specific areas darkened, could you make a lens that shone light through if presented with, say, a cat, but not with anything else? Bending light and darkening it selectively could probably reproduce a layer of a neural net. That would be cool. I suppose, you would need some substance that responded to light in a <i>nonlinear</i> way.
This is really fun to see. I love toy experiments like this. I see that each plot is always using the same initialization of weights, which presumably makes it possible to have more smoothness between each pixel. I also would guess it's using the same random seed for training (shuffling data).<p>I'd be curious to know what the plots would look like with a different randomness/shuffling of each pixel's dataset. I'd guess for the high learning rates it would be too noisy, but you might see fractal behavior at more typical and practical learning rates. You could also do the same with the random initialization of each dataset. This would get at if the chaotic boundary also exists in more practical use cases.
If you liked this, you may also enjoy: "Back Propagation is Sensitive to Initial Conditions" from the early 90's. The discussion section is fun.<p><a href="https://proceedings.neurips.cc/paper/1990/file/1543843a4723ed2ab08e18053ae6dc5b-Paper.pdf" rel="nofollow">https://proceedings.neurips.cc/paper/1990/file/1543843a4723e...</a>
I'm really curious what effect the common tricks for training have on the smoothness of this landscape: momentum, skip connections, batch/layer/etc normalization, even model size.<p>I imagine the fractal or chaos is still there, but maybe "smoother" and easier for metalearning to deal with?
This is pretty interesting. Can’t help but he reminded of all the times I’ve done acid. Having been deep in ‘fractal country’ a few times I’ve always felt the psychedelic effect is from my brain going haywire and messing up its pattern recognition. I wonder if it’s related to this.
Reminds me of an excellent 3blue1brown video about Newton’s method [1]. You can see similar fractal patterns emerge there too.<p>[1] <a href="https://www.youtube.com/watch?v=-RdOwhmqP5s" rel="nofollow">https://www.youtube.com/watch?v=-RdOwhmqP5s</a>
I hope one day we'll have generative AI capable of producing stuff like this on demand:<p><a href="https://www.youtube.com/watch?v=8cgp2WNNKmQ" rel="nofollow">https://www.youtube.com/watch?v=8cgp2WNNKmQ</a>