A related paper I just found and am digesting: <a href="https://arxiv.org/abs/2012.04728" rel="nofollow">https://arxiv.org/abs/2012.04728</a><p>Softmax gives rise to translation symmetry, batch normalization to scale symmetry, homogeneous activations to rescale symmetry. Each of those induce their own learning invariants through training.
This is one of those links where just seeing the title sets you off, thinking about the implications.<p>I'm going to have to spend more time digesting the article, but one thing that jumps out at me, and maybe it's answered in the article and I don't understand it, is the role of time. Generally in physics, you're talking about a quantity being conserved over time, and I'm not sure what plays the role of time when you're talking about conserved quantities in machine learning -- is it conserved over training iterations or over inference layers, or what?<p>edit: now that i've read it again, I just saw that they described in the second paragraph.<p>I'm now wondering if in something like Sora that can do a kind of physical modeling, if there's some conserved quantity in the neural network that is _directly analagous_ to conserved quantities in physics -- if there is, for example, something that represents momentum, that operates exactly as momentum as it progresses through the layers.
People have mentioned the discrete - continuous tradeoff. One way to bridge that gap would be to use <a href="https://arxiv.org/abs/1806.07366" rel="nofollow">https://arxiv.org/abs/1806.07366</a> - they draw an equivalence between vanilla (FC layer) neural nets of constant width with differential equations, and then use a differential equation solver to "train" a "neural net" (from what I remember - it's been years since that paper...).<p>Another approach might be to take an information theoretic view with the infinite-width finite-entropy nets.
I liked the article and I hope that I can understand it more with some study.<p>I think the following sentence in the article is wrong
"Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be horizontal, vertical, and angular momentum.”<p>I think the correct statement is
"Applying Noether's theorem gives us three conserved quantities—one for each degree of freedom in our group of transformations—which turn out to be translation, rotation, and time shifting.”<p>I think translation leads to conservation of momentum, rotation leads to conservation of angular momentum, and time shifting leads to conservation of energy (potential+kinetic). It's been a few decades since I saw the proof, so I might be wrong.
so I think this is a great connection that deserves more thought. as well as an absolutely gorgeous write-up.<p>The main problem I see with it is that most of the time you <i>don't</i> want the optimum for your objective function, as that frequently results in overfitting. this leads to things like early stopping being typical.
I wonder if an energy and work metric could be derived for gradient descent. This might be useful for a more rigorous approach to hyperparameter development, and maybe for characterizing the data being learned. We say that some datasets are harder to learn, or measure difficulty by the overall compute needed to hit a quality benchmark. Something more essential would be a step forward.<p>Like in ANN backprop, the gradient descent algorithm can use a momentum to overcome getting stuck in local minima. This was heuristically physical when I learned it.. perhaps it's been developed since. Maybe only allowing a "real" energy to the momentum would then align it with an ability to do work calculation. Might also help with ensemble/monte carlo methods, to maintain an energy account across the ensemble.
I need to digest this but it is a seductive idea. My quick take: there may be a connection between back-propagation and reversibility, both computational and physical. For a system to be reversible implies conservation of information.<p>It also makes me think about the surprising success of highly quantized models (see for example recent paper on ternary networks, where the only valid numbers re 0, 1, and -1.)<p>Artificial Neural Networks were originally conceived as an approximation to an analog, continuous system, where floating-point numbers are stand-ins for reals. This is related to the ability to back-prop because real functions are generally differentiable. But if it turns out that we can closely approximate the same behavior with a small, discrete set of integers, it makes the whole edifice feel more like some sort of Cellular Automaton with reversible rules, rather than a set of functions over the reals.<p>Finally (sorry for the rabbit-holing) - how does this relate to our brains? Note that real neurons "fire" -- that is, they generate a discrete event when their internal configuration reaches a triggering state.<p>Lots to chew on...
Very nice article! I recently had a long chat with chatgpt on this topic, although from a slightly different perspective.<p>A neural network is a type of machine that solves non linear optimization problems, and the principle of least action is also a non linear optimization problem that nature solves by some kind of natural law.<p>This is the one thing that chatgpt mentioned which surpised me the most and which I had not previously considered.<p>> Eigenvalues of the Hamiltonian in quantum mechanics correspond to energy states. In neural networks, the eigenvalues (principal components) of certain matrices, like the weight matrices in certain layers, can provide information about the dominant features or patterns. The notion of states or dominant features might be loosely analogous between the two domains.<p>I am skeptical that any conserved quantity besides energy would have a corresponding conserved quantity in ML, and the Reynolds operator will likely be relevant for understanding any correspondence like this.<p>iirc the Reynolds operator plays an important role in Noethers theorem, and it involves an averaging operation similar to what is described in the linked article.
It has been shown that a finite difference implementation of wave propagation can be expressed as a deep neural network (e.g., [1]). These networks can have thousands of layers and yet I don't think they suffer from the exploding/vanishing gradient problem, which I imagine is because in the physical system they model there are conservation laws such as conservation of energy.<p>[1] <a href="https://arxiv.org/abs/1801.07232" rel="nofollow">https://arxiv.org/abs/1801.07232</a>
As a complete amateur I was wondering if it could be possible to use that property of light ("to always choose the most optimal route") to solve the traveling salesman problem (and the whole class of those problems as a consequence). Maybe not with an algorithmic approach, but rather some smart implementation of the machine itself.
See also "Noether Networks: Meta-Learning Useful Conserved Quantities" <a href="https://arxiv.org/abs/2112.03321" rel="nofollow">https://arxiv.org/abs/2112.03321</a> from 2021.<p>Abstract: Progress in machine learning (ML) stems from a combination of data availability, computational resources, and an appropriate encoding of inductive biases. Useful biases often exploit symmetries in the prediction problem, such as convolutional networks relying on translation equivariance. Automatically discovering these useful symmetries holds the potential to greatly improve the performance of ML systems, but still remains a challenge. In this work, we focus on sequential prediction problems and take inspiration from Noether's theorem to reduce the problem of finding inductive biases to meta-learning useful conserved quantities. We propose Noether Networks: a new type of architecture where a meta-learned conservation loss is optimized inside the prediction function. We show, theoretically and experimentally, that Noether Networks improve prediction quality, providing a general framework for discovering inductive biases in sequential problems.
how do you direct what the network learns if it all comes from supervised learning training sets?<p>How do you insert rules that aren't learned into what weights are learned?