Just ran across this useful comparison with another very recent paper that effectively corroborates some of the core findings, I believe by an author of the other paper: <a href="https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-anthropic-s-dictionary-learning-to-ours" rel="nofollow noreferrer">https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-...</a>
Oh dang, I am quite literally working on this as a side project (out of mere curiosity).<p>Well, sort of ..., I'm refining an algo that takes several (carefully calibrated) outputs from a given LLM and infers the most plausible set of parameters behind it. I was expecting to find clusters of parameters very much alike to what they observe.<p>I informally call this problem <i>inverting</i> an LLM, and obv., it turns out to be non-trivial to solve. Not completely impossible, tho! as so far I've found some good approximations to it.<p>Anyway, quite an interesting read, def. will keep an eye on what they publish in the future.<p>Also, from the linked manuscript at the end,<p>>Another hypothesis is that some features are actually higher-dimensional feature manifolds which dictionary learning is approximating.<p>Well, you have something that behaves like a continuous, smooth space so you could define as many manifolds as you'd need to suit your needs, so yes :^). But, pedantry off, I get the idea and IMO that's definitely what's going on and the right framework to approach this problem from.<p>One amazing realization one can get from this is, what is the conceptual equivalent of the transition functions that connect all different manifolds in this LLM space? When you see it your mind will be blown, not because of its complexity, but rather because of its exceptional simplicity.
This looks like a big advance in alignment research. A big problem has been that LLMs were just a giant set of inscrutable numbers, and we had no idea what was going on inside.<p>But if this technique scales up, then Anthropic has fixed that. They can figure out what different groups of neurons are actually doing, and use that to control the LLM's behavior. That could help with preventing accidentally misaligned AIs.
This makes me wonder what would happen if neural networks contain manually programmed components. It seems like trivial components such as detecting DNA sequences could be programmed in by manually setting the weights. The same thing could be done for example to give neural networks a maths component. Would the network when training discover and make use of these predefined components, or would it ignore them and make up its own ways of detecting DNA sequences?
I am hoping that this type of research leads into ways to create highly tuned and steerable models that are also much smaller and more efficient.<p>Because if you can see what each part is doing, then theoretically you can find ways to create just the set of features you want. Or maybe tune features that have redundant capacity or something.<p>Maybe by studying the features they will get to the point where the knowledge can be distilled into something more like a very rich and finely defined knowledge graph.
One large model is not how the brain works. It’s not how org charts work.<p>That LLMs are capable of what they are at the compute density they are strongly signals to me that the task of making a productive knowledge worker is in overhang territory.<p>The missing piece isn’t LLM advancement, it’s LLM management.<p>Building trust in an inwardly-adversarial LLM org chart that reports to you.
I'm just curious, how polysemantic is the human brain with each neuron? Cause it feels to me, what you really want, and what the human brain might have, is a high-information (feature based / conceptual based / macro pattern based) monosemantic neural network, and where there is polysemantic neurons, they share similar or the same information in the feature it is a part of (leading to space efficiency? as well as computational efficiency). Whereas in transofmrer models like this, it's as if you're superimposing a million human brains on top of the same network, and then averaging out somehow all the features in the training set into unique neurons (leading naturally to a much larger "brain"). And also they mention in the paper that monosemantic neurons in the network don't work well, but my intuition would be because they are way too "high precision" and they aren't encoding enough information at the feature-level. Features are imo low dimensional, and then a monosemantic high dimensional neuron would the encode way too little information or something. But this is based on my lack of knowledge of the human brain so maybe there are way more similarities than I'm aware of...
I am a lay person. To me, I understand a trained model describes transitions from one symbol to the next with probabilities between nodes. There is a structure to this graph — after all if there weren’t then training would be impossible — but this structure is as if it is all written on one sheet of paper with the definitions of each node all inked on top of each other in differed colors.<p>This research (and it’s parent and sibling papers, from the LW article) seem to be about picking out those colored graph components from the floating point soup?
Wait, embeddings were used for classification for a long time now. Can somebody explain what is new here?<p><i>edit</i>: ah, looked at the paper, they did it unsupervised, with a sparse autoencoder.
From a machine learning layman's point of view but with some experience with modeling, it's hard to see this as a discovery. Model decomposition and model reduction techniques are very basic concepts in mathematical modeling, and decomposing models in modes with high participation is a very basic technique, which boils down to finding linear combinations of basis that are more expressive.<p>This is even less surprising given LLMs are applied to models with a known hierarchical structure and symmetry.<p>Can anyone say exactly what's novel in these findings? From a layman's point of view, this sounds like announcing the invention of gunpowder.
All machine learning is just renormalization which in turn is a convolution in Hopf algebra. That's why you see superposition<p>"In physics, wherever there is a linear system with a "superposition principle", a convolution operation makes an appearance."<p>I'm working this out in more details but it is uncanny how much it works out.<p>I have a discord if you want to discuss this further<p><a href="https://discord.cofunctional.ai" rel="nofollow noreferrer">https://discord.cofunctional.ai</a>
So, I came up with a pretty decent neural net from scratch about 20 years ago - it ran in the browser in Flash. It basically had a 10x10 bitmap input and an output of the same size, and lots of "neurons" in between that strengthened or weakened their connections based on feedback from the end result. And at a certain point they randomly mutated how they processed the input.<p>I don't see anything wildly different now, other than scale and youth and the hubris that accompanies those things.