The main point of this is that natural gradient descent is a second-order method. The main GD update equation is:<p>∇̃L(θ) = F⁻¹∇L(θ)<p>which requires solving a linear system. For this, you can use the methods from the author's previous paper [Thermodynamic Linear Algebra](<a href="https://arxiv.org/abs/2308.05660" rel="nofollow">https://arxiv.org/abs/2308.05660</a>).<p>Since it's hard to implement a full neural network on a thermodynamic computer, the paper suggests running one in parallel to a normal GPU. The GPU computes F and ∇L(θ), but offloads the linear system to the thermo computer, which runs in parallel to the digital system (Figure 1).<p>It is important to note that the "Runtime vs Accuracy" plot in Figure 3 uses a "timing model" for the TNGD algorithm, since the computer necessary to run the algorithm still doesn't exist.
Cool and interesting. The authors propose a hybrid digital-analog training loop that takes into account the curvature of the loss landscape (i.e., it uses second-order derivatives), and show with numerical simulations that if their method is implemented in a hybrid digital-analog physical system, each iteration in the training loop would incur computational cost that is linear in the number of parameters. I'm all for figuring out ways to let the Laws of Thermodynamics do the work of training AI models, if doing so enables us to overcome the scaling limitations and challenges of existing digital hardware and training methods.
I know they mainly present results on deep learning/neural network training and optimization, but I wonder how easy it would be to use the same optimization framework for other classes of hard or large optimization problems. I was also curious about this when I saw posts about Extropic (<a href="https://www.extropic.ai/" rel="nofollow">https://www.extropic.ai/</a>) stuff for the first time.<p>I tried looking into any public info on their website about APIs or software stack to see what's possible beyond NN stuff to model other optimization problems. It looks like that's not shared publicly yet.<p>There are certainly many NP-hard and large combinatorial or analytical optimization problems still out there that are worth being able to tackle with new technology. Personally, I care about problems in EDA and semiconductor design. Adiabatic quantum computing was one technology with the promise of solving optimization problems (and quantum computing is still playing out with only small-scale solutions at the moment). Hoping that these new "thermodynamic computing" startups also might provide some cool technology to explore these problems with.
Leveraging thermodynamics to more efficiently compute second-order updates is certainly cool and worth exploring, however specifically in the context of deep learning I remain skeptical of its usefulness.<p>We already have very efficient second-order methods running on classical hardware [1] but they are basically not being used at all in practice, as they are outperformed by ADAM and other 1st-order methods. This is because optimizing highly nonlinear loss functions, such as the ones in deep learning models, only really works with very low learning rates, regardless of whether a 1st or a 2nd order method is used. So, comparatively speaking, a 2nd order method might give you a slightly better parameter update per step but at a more-than-slightly-higher cost, so most of the time it's simply not worth doing.<p>[1] <a href="https://andrew.gibiansky.com/blog/machine-learning/hessian-free-optimization/" rel="nofollow">https://andrew.gibiansky.com/blog/machine-learning/hessian-f...</a>
Not having read the paper carefully, could someone tell me what the draw is? It looks like it is going to have the same asymptotic complexity as SGD in terms of sample size, per Table 1. Given that today's large, over-specified models have numerous, comparable extrema, is there even a need for this? I wouldn't get out of bed unless it were sublinear.
this reminds me of simulated annealing which I learned about in an AI class about a decade ago.<p><a href="https://en.wikipedia.org/wiki/Simulated_annealing" rel="nofollow">https://en.wikipedia.org/wiki/Simulated_annealing</a>
I don't get it, gradient descend computation is super frequent, state/input changes all the time, you'd have to reset heat landscape very frequently, what's the point? No way there is any potential speedup opportunity there, no?<p>If anything you could probably do something with electromagnetic fields, their interference, possibly in 3d.