I used simulated annealing in my previous work doing molecular dynamics- it was great for not getting your protein trapped in a local minima, which gradient descent is prone to doing. I got excellent results for very little work.<p>I asked my intern, who was knowledgeable in deep networks as well as molecular stuff, "it looks like ML training mainly does gradient descent, how can that work, don't you get stuck in local minima?" and they said "loss functions in ML are generally believed to be bowl-shaped" and I've been wondering how that could be true.<p>It's interesting to read up on the real-world use of annealing for steel - it's quite intersting how you can change steel properties through heat treatment. Want it really strong? Quench it fast, that will lock it into an unstable structuer that's still strong. Quench it slow, it will find a more stable minimum, and be more ductile.