科技回声

10 条评论

stared大约 5 年前

The lottery ticket hypothesis is IMHO the single most interesting finding for deep learning. It explains why does deep learning works (vs shallow neural nets), why is initial over-parametrization is often useful, why deeper is often better than shallow, etc.I recommend for an overview:- the original paper "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks", <a href="https://arxiv.org/abs/1803.03635" rel="nofollow">https://arxiv.org/abs/1803.03635</a>- "Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask" <a href="https://eng.uber.com/deconstructing-lottery-tickets/" rel="nofollow">https://eng.uber.com/deconstructing-lottery-tickets/</a> showing that if we remove "non-winning tickets" before the training, the trained network still works well

评论 #22512895 未加载

评论 #22515407 未加载

评论 #22514937 未加载

xt00大约 5 年前

If “pruning is all you need” that does feel like a way of explaining how intelligence could come out of a mass of neurons such as our brain. Or at least that sounds like a thing that makes it understandable to me. Basically add a bunch of connections relatively randomly, start pruning slowly until you hit a point where the system changes... I’ll keep hand waving until somebody who knows this stuff can chime in.. :)

评论 #22513895 未加载

评论 #22516035 未加载

评论 #22515811 未加载

IX-103大约 5 年前

This is really neat and has a lot of implications for porting larger models to limited platforms like mobile. Unfortunately you still have to train the larger network, so the gains are somewhat limited. Some other papers I read show that you might be able to prune the network in the middle of training, which would make larger models more practical to work with.

评论 #22512153 未加载

rubyn00bie大约 5 年前

Am I understanding this right? Surely, I must be missing the entire point because...This looks like to me, adding more and more bullshit to a model while managing to increase its accuracy, eventually leads to a "smaller" model with less bullshit?That is to say, adding correlated or endogenous variables to a model (over-parameterization), so long as it increases its accuracy, will one day yield, a smaller, more optimized, model with less variables?If so; why is this news? Isn't this like the fundamental process of most statistics and optimization problems? Or like isn't adding more data (when available) a fundamental method of solving/fixing with multicolinearity?

评论 #22518014 未加载

bo1024大约 5 年前

I have a question. They show that any given depth-ell network, computing F, is w.h.p. approximated by some subnetwork of a random depth-2ell network.But there is a theorem that even depth-2 networks can approximate any continuous function F. If the assumptions were the same, then their theorem would imply any continuous function F is w.h.p. approximated by some subnetwork of a depth-4 network.So what is the difference in assumptions, i.e. what’s the significance of F being computed by a depth-ell network? What functions can a depth-ell+1 network approximate that a depth-ell network can’t? I’d guess it has to do with Lipschitz assumptions and bounded parameters but would be awesome if someone can clarify!

评论 #22514896 未加载

anonymousDan大约 5 年前

As a potentially naive thought experiment, if you just generated in advance a number of random networks of similar size to the pruned lottery ticket, and then trained them all in parallel, would you eventually find a lottery ticket? If so how many would you have to train to find a lottery ticket with high probability? Why is training one big network and then pruning any better than training lots of different smaller network? Assume in all of the above that you have a rough idea of how big the pruned network will be be.

评论 #22516520 未加载

评论 #22515704 未加载

tells大约 5 年前

ELI5 someone please.

m0zg大约 5 年前

So in other words, a sufficiently large set of monkeys with typewriters contains a subset which approximates the works of Shakespeare.

lonelappde大约 5 年前

This paper formally proves what everyone already intuitively knows, right?It's mathematically interesting, but not a practical advance.

zackmorris大约 5 年前

I've always felt the there is a deep connection between evolution and thought, or more specifically, genetic algorithms (GAs) and neural networks (NNs).The state of the art when I started following AI in the late 90s was random weights and hyper-parameters chosen with a GA, then optimized with NN hill climbing to find the local maximum. Looks like the research has continued:<a href="https://www.google.com/search?q=genetic+algorithm+neural+network" rel="nofollow">https://www.google.com/search?q=genetic+algorithm+neural+net...</a>All I'm saying is that since we're no longer compute-bound, I'd like to see more big-picture thinking. We're so obsessed with getting 99% accuracy on some pattern-matching test that we completely miss other options, like in this case that effective subnetworks can evolve within a larger system of networks.I'd like to see a mathematical proof showing that these and all other approaches to AI like simulated annealing are (or can be made) equivalent. Sort of like a Church–Turing thesis for machine learning:<a href="https://en.wikipedia.org/wiki/Church–Turing_thesis" rel="nofollow">https://en.wikipedia.org/wiki/Church–Turing_thesis</a>If we had this, then we could use higher-level abstractions and substitute simpler algorithms (like GAs) for the harder ones (like NNs) and not get so lost in the minutia and terminology. Once we had working solutions, we could analyze them and work backwards to covert them to their optimized/complex NN equivalents.An analogy for this would be solving problems in our heads with simpler/abstract methods like spreadsheets, functional programming and higher-order functions. Then translating those solutions to whatever limited/verbose imperative languages we have to use for our jobs.Edit: I should have said "NN gradient descent to find the local minimum" but hopefully my point still stands.Edit 2: I should clarify that in layman's terms, Church-Turing says "every effectively calculable function is a computable function" so functional programming and imperative programming can solve the same problems, be used interchangeably and even be converted from one form to the other.

评论 #22513472 未加载

评论 #22513972 未加载

10 条评论

stared大约 5 年前

评论 #22512895 未加载

评论 #22515407 未加载

评论 #22514937 未加载

xt00大约 5 年前

评论 #22513895 未加载

评论 #22516035 未加载

评论 #22515811 未加载

IX-103大约 5 年前

评论 #22512153 未加载

rubyn00bie大约 5 年前

评论 #22518014 未加载

bo1024大约 5 年前

评论 #22514896 未加载

anonymousDan大约 5 年前

评论 #22516520 未加载

评论 #22515704 未加载

tells大约 5 年前

ELI5 someone please.

m0zg大约 5 年前

So in other words, a sufficiently large set of monkeys with typewriters contains a subset which approximates the works of Shakespeare.

lonelappde大约 5 年前

This paper formally proves what everyone already intuitively knows, right?It's mathematically interesting, but not a practical advance.

zackmorris大约 5 年前

评论 #22513472 未加载

评论 #22513972 未加载

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

10 条评论

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

10 条评论