A year ago, a group of us at UMD took a couple weeks to go through this paper. It's got some interesting insights, but like all theory papers right now, gaps remain. The construction they present for learning Hamiltonians of low polynomial order doesn't look a ton like any common production neural network modules, and the justification for why we would only be dealing with that Hamiltonian family in practice is unconvincing to me. That said, overall it's worth a close read. Section 2' parts A and B are the best summary of the connection between probability theory and deep learning that i have come across.
Here's a related 2016 talk by Max Tegmark (second author) on connections between deep learning and physics:<p><a href="https://www.youtube.com/watch?v=5MdSE-N0bxs" rel="nofollow">https://www.youtube.com/watch?v=5MdSE-N0bxs</a><p>The gist of it is that physical data tends to have symmetries, and these symmetries make descriptions of the data very compressible into relatively small neural circuits. Random data does not have this property, and cannot be learned easily. Super fascinating.
I really like this kind of cross-disciplinary research and knowledge transfer. It should happen more often.<p>It’s interesting that so many equations governing different laws in different fields actually share quite a few properties, and on deeper analysis, can be explained by a single mathematical property. It makes me wonder how many insights we are missing simply because they were discovered in another field, under a different name, for a different purpose.
Learning a transformation from the full transformation semigroup is the most general case. Consider a mystery unary CPU operation M on a 64 bit register. How deep of a circuit do you need to calculate any such transformation M? According to Kolmogorov just writing out a random transformation function takes (2^64)*(64) bits to make the lookup table. log of that to get a result out efficiently. Results here were already proved in information theory. If you use a depth less than log of the lookup table you are screwed unless your function is extremely non-random.
> we show that n variables cannot be multiplied using fewer than 2^n neurons in a single hidden layer.<p>I don't know, but it just feels like this should have been known in CS earlier than 2016. Circuit complexity has been studied for a long time.
This is a very nice paper that puts some "meaningful" names in neural network function. I like how eq. 8 neatly describes a neural network as a readout of an energy hamiltonian and an output distribution. They show that real-world data, thats is described by low-order polynomial hamiltonians , needs a small number of units and that "depth" gives the network its compositional/hierarchical ability. Even though some of the theory goes over my head, their main arguments seem to "fit" together very nicely.<p>So basically a deep network can "cheaply" (i.e. not fatally expensive) describe anything that occurs in nature, which is wonderful. I wonder however what will happen when we move to higher cognition and meta-cognition which requires the readout of network states that are not found in nature, but are generated internally. Would be interesting to know if we need much more brain or a little more. In any case very intersting read.
Love to see more papers like this. I remember a recent paper that showed decent model performance when the model was allowed to pick its own activation functions. It was picking wacky things like sine waves. You'd like to learn something about the model by the way it configures itself, but right now we can only understand simple model features.
I am not a Physicist (IANAP), so maybe I’m misrecalling some definitions, but isn’t the restriction to Hamiltonians a bit, well, restrictive? This limits the results to path independent potentials, which is basically nothing in the non-spherical cow world. Are the authors working from a different definition of Hamiltonian?<p>Does it matter if your Hamiltonian is smooth or if you are working from a discrete theory?
Same reason general relativity works. If you try to model results without a fundamental principle of the system's content's operation then you are going to have some limitations.
Large companies like Google and Facebook will easily be able adapt to the change. They have an army of lawyers and workforce for this.
Smaller startup? They will face challenges. Especially European ones.
Most possibly, this "innovation" will end up as one more pop-up with "Accept or Leave" message on every website you are visiting from Europe for the first time.
Here's an HTML version of the paper if you're on a phone: <a href="https://www.arxiv-vanity.com/papers/1608.08225/" rel="nofollow">https://www.arxiv-vanity.com/papers/1608.08225/</a>
It does not. It's about unsupervised learning and unlabeled data. It's the next frontier. Automated feature engineering for triangulated datasets is the real target.