Empiricism and the limits of gradient descent

108 pointsby togeliusalmost 7 years ago

9 comments

simonsteralmost 7 years ago

There are a couple of factual errors here. First, the difference between backprop and evolution is smaller than the author indicates. The error signal used in modern backprop training is stochastic because it is computed on a minibatch (which is why it's called stochastic gradient descent). This stochasticity seems important to achieving good results. And the most popular evolutionary algorithm in the deep learning world is Evolution Strategies, which effectively approximates a gradient. Ordinary genetic algorithms are not gradient-based and have recently shown promise in limited domains, but can't compete with gradient-based algorithms for supervised learning.The key claim in the article, that gradient descent could not discover physics from equations seems, like it is a statement about neural networks, not gradient descent. Given sufficient training data, a neural network can probably learn to model physics. I sympathize with the concern that it's very difficult to translate a neural network's knowledge into human concepts, but I see no reason to believe that optimizing the same system with an evolutionary algorithm would make this problem any easier. You could e.g. try to do program induction (which was supposed to be the future of AI many decades ago) instead of modeling the data directly, but choosing to perform program induction does not preclude the use of a neural network. Neural networks trained by gradient descent can generate ASTs (e.g. <a href="http://nlp.cs.berkeley.edu/pubs/Rabinovich-Stern-Klein_2017_AbstractSyntaxNetworks_paper.pdf" rel="nofollow">http://nlp.cs.berkeley.edu/pubs/Rabinovich-Stern-Klein_2017_...</a>).[Edited to remove reference to universal approximation; as comments point out, even if a neural network can approximate a function, it isn't guaranteed to be able to learn it. But I am reasonably confident that a neural network can learn Newton's second law.]

评论 #17171242 未加载

评论 #17170395 未加载

评论 #17170642 未加载

评论 #17172199 未加载

评论 #17177640 未加载

评论 #17171729 未加载

评论 #17171941 未加载

tlbalmost 7 years ago

I'm optimistic about the potential for evolutionary algorithms. I've used both EAs and gradient descent in developing robot controllers.But the argument here about why gradient descent won't be able to learn certain things is weak. Thought experiments are not a reliable guide to what what GD can or can't do.It's fair enough to say that F=ma and E=mc² aren't in the data. Indeed, it took thousands of years of human thought to arrive at them. So the argument "it's not clear how an algorithm could extract F=ma from the data" isn't a strong criticism, because humans also can't do it by induction.The long process culminating in F=ma involved a lot of abstract symbolic thought. Whether human-level abstract symbolic thought can be learned through GD (probably in combination with some sort of Monte Carlo tree search) is an open question. It can only be answered by trying to build things and seeing if they work.If you want to make an argument about the limits of GD and induction, it'd be better to compare to a problem humans can solve reliably, rather than an insight that one genius had after decades of thought while standing on the shoulders of other geniuses.

评论 #17170795 未加载

评论 #17172611 未加载

评论 #17170139 未加载

评论 #17170291 未加载

sepranualmost 7 years ago

The points the author makes about gradient descent are accurate, in a sense. However, they oversimplify the technique (as it is currently applied today) and the context in which it is used. It seems as if the author, like many others, has a basic understanding of the subject's basic mechanisms, but not the context in which experts understand them.The example the author cites regarding evo algorithms learning physical laws is laughable - "It's just not in the data - it has to be invented" applies equally to both the backprop and the evolutionary learning algorithms."In this case, the representation (mathematical expressions represented as trees) is distinctly non-differentiable, so could not even in principle be learned through gradient descent."This is incorrect, almost like saying NLP data is not differentiable. For instance, set this representation up as the output of a network (or, if you wanted to be fancier, the central component of an autoencoder), and see how well it predicts/correlates with the experimental data. This is the error, which is back-propagated through the network's nodes.FWIW, many theoreticians believe that the unreasonable effectiveness of neural networks and especially transfer learning is a result of their well-suitedness to encode laws of physics and Euclidean geometry. The author's final points about a nine-year-old survey may be out of date w.r.t. contemporary neural networks, which often have spookily good local minima and do not behave in the way intuition about gradient descent might suggest.

hmartirosalmost 7 years ago

In my experience if you have even a little smoothness in your problem's cost manifold, taking advantage of gradients is invaluable to sample efficiency. Many losses which don't seem differentiable can be reformulated as such - you can look around and see a wide array of algorithms being put into end-to-end learned frameworks. If the dimensionality is small, second-order methods (or approximations thereof) can do dramatically better yet. However, I'm also a fan of evolutionary algorithms. I see no reason why evolutionary rules can't be defined with awareness of gradient signals.

评论 #17172249 未加载

pmalyninalmost 7 years ago

Once again, there is a general conflation of evolutionary algorithms and learning without a differentiable error function. I've presented this argument again with a discussion with antirez here on HN [1], but the crux of is that Reinforcement Learning as it stands is optimization over (maybe)-non-differential errors. For AlphaGo there is no gradient, per se, that says you will optimize your wins if you go this way (now, it is optimized by training towards the win-rate "score", which could be an error score) -- look at REINFORCE for other variations. Evolutionary Learning and Reinforcement Learning as two sides of the same coin.[1] <a href="https://news.ycombinator.com/item?id=16652138" rel="nofollow">https://news.ycombinator.com/item?id=16652138</a>

creoalmost 7 years ago

(This is of course my opinion, not scientific fact) The biggest argument for EA is that proper implementation does not get stuck in local minimum. GD and SGD have that tendency, if not in one spot then probably looping between many. Problem with EA by the other hand is that its mathematical model is based on probability which is quite tricky to operate on even for above-average programmer.

ssivarkalmost 7 years ago

I don't see why stochastic has to be worse than evolutionary algorithms. In high-dimensional spaces, there is a large space of possible "mutations". SGD just biases certain mutations based on the gradient from a minibatch. That sounds a lot like evaluating the fitness of a mutated population on a minibatch and culling members with low fitness. In fact, there are many demonstrations that the stochastic nature of SGD (coming from the use of minibatches) is crucial for effective learning.

ozyalmost 7 years ago

Notice how blacks swans is a labeling/categorization problem. Whereas F=ma is a causal relationship. The first has no induction problem, there is no swan-ness except in our minds. The second has no induction problem, because you model causal relationships and predict from them, better predictions means less wrong. The problem of induction is only one of "ultimate reality".

andbbergeralmost 7 years ago

Meh, weird article. None of the nice modern results (imagenet family, etc) were achieved just through gradient descent - this article, like most, seems to be missing the forest for the trees with deep learning.It's not about the network architecture, or gradient descent on their own - it's the interaction, the dynamical system over weight space that training is.Behind every great modern deep learning result? An enormous hyperparameter search and lots of elbow grease to carefully tune that dynamical system juuuuust right so the weight particle ends up in just the right place when training finishes. Smells like evolution to me. Deepmind even formalized the evolutionary process a deep learning researcher runs manually when fine-tuning a model into population based training <a href="https://deepmind.com/blog/population-based-training-neural-networks/" rel="nofollow">https://deepmind.com/blog/population-based-training-neural-n...</a>