Why does deep and cheap learning work so well?

434 pointsby mauliknshahabout 7 years ago

15 comments

heyitsguayabout 7 years ago

A year ago, a group of us at UMD took a couple weeks to go through this paper. It's got some interesting insights, but like all theory papers right now, gaps remain. The construction they present for learning Hamiltonians of low polynomial order doesn't look a ton like any common production neural network modules, and the justification for why we would only be dealing with that Hamiltonian family in practice is unconvincing to me. That said, overall it's worth a close read. Section 2' parts A and B are the best summary of the connection between probability theory and deep learning that i have come across.

tedsandersabout 7 years ago

Here's a related 2016 talk by Max Tegmark (second author) on connections between deep learning and physics:<a href="https://www.youtube.com/watch?v=5MdSE-N0bxs" rel="nofollow">https://www.youtube.com/watch?v=5MdSE-N0bxs</a>The gist of it is that physical data tends to have symmetries, and these symmetries make descriptions of the data very compressible into relatively small neural circuits. Random data does not have this property, and cannot be learned easily. Super fascinating.

评论 #17050842 未加载

评论 #17052125 未加载

评论 #17050844 未加载

chatmastaabout 7 years ago

I really like this kind of cross-disciplinary research and knowledge transfer. It should happen more often.It’s interesting that so many equations governing different laws in different fields actually share quite a few properties, and on deeper analysis, can be explained by a single mathematical property. It makes me wonder how many insights we are missing simply because they were discovered in another field, under a different name, for a different purpose.

评论 #17054688 未加载

评论 #17055367 未加载

crb002about 7 years ago

Learning a transformation from the full transformation semigroup is the most general case. Consider a mystery unary CPU operation M on a 64 bit register. How deep of a circuit do you need to calculate any such transformation M? According to Kolmogorov just writing out a random transformation function takes (2^64)*(64) bits to make the lookup table. log of that to get a result out efficiently. Results here were already proved in information theory. If you use a depth less than log of the lookup table you are screwed unless your function is extremely non-random.

评论 #17057794 未加载

pettersabout 7 years ago

> we show that n variables cannot be multiplied using fewer than 2^n neurons in a single hidden layer.I don't know, but it just feels like this should have been known in CS earlier than 2016. Circuit complexity has been studied for a long time.

评论 #17053212 未加载

zerostar07about 7 years ago

This is a very nice paper that puts some "meaningful" names in neural network function. I like how eq. 8 neatly describes a neural network as a readout of an energy hamiltonian and an output distribution. They show that real-world data, thats is described by low-order polynomial hamiltonians , needs a small number of units and that "depth" gives the network its compositional/hierarchical ability. Even though some of the theory goes over my head, their main arguments seem to "fit" together very nicely.So basically a deep network can "cheaply" (i.e. not fatally expensive) describe anything that occurs in nature, which is wonderful. I wonder however what will happen when we move to higher cognition and meta-cognition which requires the readout of network states that are not found in nature, but are generated internally. Would be interesting to know if we need much more brain or a little more. In any case very intersting read.

jdonaldsonabout 7 years ago

Love to see more papers like this. I remember a recent paper that showed decent model performance when the model was allowed to pick its own activation functions. It was picking wacky things like sine waves. You'd like to learn something about the model by the way it configures itself, but right now we can only understand simple model features.

评论 #17051934 未加载

electricslpnsldabout 7 years ago

I am not a Physicist (IANAP), so maybe I’m misrecalling some definitions, but isn’t the restriction to Hamiltonians a bit, well, restrictive? This limits the results to path independent potentials, which is basically nothing in the non-spherical cow world. Are the authors working from a different definition of Hamiltonian?Does it matter if your Hamiltonian is smooth or if you are working from a discrete theory?

评论 #17054150 未加载

评论 #17055394 未加载

tomdreabout 7 years ago

One of my dreams is to do something similar by simplifying neural nets via (bi)simulation (something along the lines of quotient automata).

评论 #17052699 未加载

sp332about 7 years ago

Man, it's good to see some actual information theory applied to deep learning for a change.

评论 #17050911 未加载

评论 #17050919 未加载

评论 #17051168 未加载

blueprintabout 7 years ago

Same reason general relativity works. If you try to model results without a fundamental principle of the system's content's operation then you are going to have some limitations.

评论 #17051593 未加载

dkaigorodovabout 7 years ago

Large companies like Google and Facebook will easily be able adapt to the change. They have an army of lawyers and workforce for this. Smaller startup? They will face challenges. Especially European ones. Most possibly, this "innovation" will end up as one more pop-up with "Accept or Leave" message on every website you are visiting from Europe for the first time.

megaman22about 7 years ago

Probably because so little is actually driven by real data and so much by marketing and hype.

bfirshabout 7 years ago

Here's an HTML version of the paper if you're on a phone: <a href="https://www.arxiv-vanity.com/papers/1608.08225/" rel="nofollow">https://www.arxiv-vanity.com/papers/1608.08225/</a>

KasianFranksabout 7 years ago

It does not. It's about unsupervised learning and unlabeled data. It's the next frontier. Automated feature engineering for triangulated datasets is the real target.