I wasn't sure if this paper was parody on reading the abstract. It's not parody. Two things stand out to me: first is the idea of distilling these networks down into a smaller latent space, and then mucking around with that. That's interesting, and cross-sections a bunch of interesting topics like interpretability, compression, training, over- and under-.. The second is that they show the diffusion models don't just converge on similar parameters as the ones they train against/diffuse into, and that's also interesting.<p>I confess I'm not sure what I'd do with this in the random grab bag of Deep Learning knowledge I have, but I think it's pretty fascinating. I might like to see a trained latent encoder that works well on a bunch of different neural networks; maybe that thing would be a good tool for interpreting / inspecting.
This doesn't seem all that impressive when you compare it to earlier work like 'g.pt' <a href="https://arxiv.org/abs/2209.12892" rel="nofollow">https://arxiv.org/abs/2209.12892</a> Peebles et al 2022. They cite it in passing, but do no comparison or discussion, and to my eyes, g.pt is a lot more interesting (for example, you can prompt it for a variety of network properties like low vs high score, whereas this just generates unconditionally) and more thoroughly evaluated. The autoencoder here doesn't seem like it adds much.
Seems like we're getting very close to recursive self-improvement [0].<p>[0] <a href="https://www.lesswrong.com/tag/recursive-self-improvement" rel="nofollow">https://www.lesswrong.com/tag/recursive-self-improvement</a>
"We synthesize 100 novel parameters by feeding random noise into the latent diffusion model and the trained decoder." Cool that patterns exist at this level, but also, 100 params means we have a long way to go before this process is efficient enough to synthesize more modern-sized models.
fuck. I have an idea just like this one. I guess it's true that ideas are a dime a dozen.
Diffusions bear a remarkable similarity to backpropagation to me. I thought that it could be used in place of it for some parts of a model.<p>Furthermore, I posit that resnet especially in transformers allows the model into a more exploratory behavior that is really powerful, and is a necessary component of the power of transformers. Transformers is just such a great architecture the more i think about it. It's doing so many things so right. Although this is not really related to the topic.
Important to note, they say "From these generated models, we select the one with the best performance on the training set." Definitely potential for bias here.
Am i missing something, or is this just a case of "amortized inference", where you train a model (here a diffusion one), to infer something that was previously found via optimization procedure? (here NN parameters).
The state of art neural net architecture, whether that be transformers or the like, trained on self play to optimize non-differentiable but highly efficient architectures is the way.
Hm, so does this actually improve/condense the representation for certain applications or is this some more some kind of global expand and collect in network space?