The lottery ticket hypothesis is IMHO the single most interesting finding for deep learning. It explains why does deep learning works (vs shallow neural nets), why is initial over-parametrization is often useful, why deeper is often better than shallow, etc.<p>I recommend for an overview:<p>- the original paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", <a href="https://arxiv.org/abs/1803.03635" rel="nofollow">https://arxiv.org/abs/1803.03635</a><p>- "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" <a href="https://eng.uber.com/deconstructing-lottery-tickets/" rel="nofollow">https://eng.uber.com/deconstructing-lottery-tickets/</a> showing that if we remove "non-winning tickets" before the training, the trained network still works well
If “pruning is all you need” that does feel like a way of explaining how intelligence could come out of a mass of neurons such as our brain. Or at least that sounds like a thing that makes it understandable to me. Basically add a bunch of connections relatively randomly, start pruning slowly until you hit a point where the system changes... I’ll keep hand waving until somebody who knows this stuff can chime in.. :)
This is really neat and has a lot of implications for porting larger models to limited platforms like mobile. Unfortunately you still have to train the larger network, so the gains are somewhat limited. Some other papers I read show that you might be able to prune the network in the middle of training, which would make larger models more practical to work with.
Am I understanding this right? Surely, I must be missing the entire point because...<p>This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?<p>That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?<p>If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?
I have a question. They show that any given depth-ell network, computing F, is w.h.p. approximated by some subnetwork of a random depth-2ell network.<p>But there is a theorem that even depth-2 networks can approximate <i>any</i> continuous function F. If the assumptions were the same, then their theorem would imply any continuous function F is w.h.p. approximated by some subnetwork of a depth-4 network.<p>So what is the difference in assumptions, i.e. what’s the significance of F being computed by a depth-ell network? What functions can a depth-ell+1 network approximate that a depth-ell network can’t? I’d guess it has to do with Lipschitz assumptions and bounded parameters but would be awesome if someone can clarify!
As a potentially naive thought experiment, if you just generated in advance a number of random networks of similar size to the pruned lottery ticket, and then trained them all in parallel, would you eventually find a lottery ticket? If so how many would you have to train to find a lottery ticket with high probability? Why is training one big network and then pruning any better than training lots of different smaller network? Assume in all of the above that you have a rough idea of how big the pruned network will be be.
I've always felt the there is a deep connection between evolution and thought, or more specifically, genetic algorithms (GAs) and neural networks (NNs).<p>The state of the art when I started following AI in the late 90s was random weights and hyper-parameters chosen with a GA, then optimized with NN hill climbing to find the local maximum. Looks like the research has continued:<p><a href="https://www.google.com/search?q=genetic+algorithm+neural+network" rel="nofollow">https://www.google.com/search?q=genetic+algorithm+neural+net...</a><p>All I'm saying is that since we're no longer compute-bound, I'd like to see more big-picture thinking. We're so obsessed with getting 99% accuracy on some pattern-matching test that we completely miss other options, like in this case that effective subnetworks can evolve within a larger system of networks.<p>I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:<p><a href="https://en.wikipedia.org/wiki/Church–Turing_thesis" rel="nofollow">https://en.wikipedia.org/wiki/Church–Turing_thesis</a><p>If we had this, then we could use higher-level abstractions and substitute simpler algorithms (like GAs) for the harder ones (like NNs) and not get so lost in the minutia and terminology. Once we had working solutions, we could analyze them and work backwards to covert them to their optimized/complex NN equivalents.<p>An analogy for this would be solving problems in our heads with simpler/abstract methods like spreadsheets, functional programming and higher-order functions. Then translating those solutions to whatever limited/verbose imperative languages we have to use for our jobs.<p>Edit: I should have said "NN gradient descent to find the local minimum" but hopefully my point still stands.<p>Edit 2: I should clarify that in layman's terms, Church-Turing says "every effectively calculable function is a computable function" so functional programming and imperative programming can solve the same problems, be used interchangeably and even be converted from one form to the other.