Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

168 pointsby 1wheel12 months ago

28 comments

tel12 months ago

This is exceptionally cool. Not only is it very interesting to see how this can be used to better understand and shape LLM behavior, I can’t help but also think it’s an interesting roadmap to human anthropology.If we see LLMs as substantial compressed representations of human knowledge/thought/speech/expression—and within that, a representation of the world around us—then dictionary concepts that meaningfully explain this compressed representation should also share structure with human experience.I don’t mean to take this canonically, it’s representations all the way down, but I can’t help but wonder what the geometry of this dictionary concept space says about us.

评论 #40431398 未加载

e63f67dd-065b12 months ago

I find Anthorpic's work on mech interp fascinating in general. Their initial towards monosemanticity paper was highly surprising, and so is this with the ability to scale to a real production-scale LLM.My observation is, and this may be more philosophical than technical: this process of "decomposing" middle-layer activations with a sparse autoencoder -- is it capturing accurately underlying features in the latent space of the network, or are we drawing order from chaos, imposing monosemanticity where there aren't any? Or to put it another way, were the features always there, learnt by training, or are we doing post-hoc rationalisations -- where the features exist because that's how we defined the autoencoders' dictionaries, and we learn only what we wanted to learn? Are the alien minds of LLMs truly also operating on a similar semantic space as ours, or are we reading tea leaves and seeing what we want to see?Maybe this distinction doesn't even make sense to begin with; concepts are made by man, if clamping one of these features modifies outputs in a way that is understandable to humans, it doesn't matter if it's capturing some kind of underlying cluster in the latent space of the model. But I do think it's an interesting idea to ponder.

评论 #40437384 未加载

评论 #40436897 未加载

评论 #40438098 未加载

bjterry12 months ago

It would be interesting to allow users of models to customize inference by tweaking these features, sort of like a semantic equalizer for LLMs. My guess is that this wouldn't work as well as fine-tuning, since that would tweak all the features at once toward your use case, but the equalizer would require zero training data.The prompt itself can trigger the features, so if you say "Try to weave in mentions of San Francisco" the San Francisco feature will be more activated in the response. But having a global equalizer could reduce drift as the conversation continued, perhaps?

评论 #40435401 未加载

评论 #40436786 未加载

kromem12 months ago

Great work as usual.I was pretty upset seeing the superalignment team dissolve at OpenAI, but as is typical for the AI space, the news of one day was quickly eclipsed by the next day.Anthropic are really killing it right now, and it's very refreshing seeing their commitment to publishing novel findings.I hope this finally serves as the nail in the coffin on the "it's just fancy autocomplete" and "it doesn't understand what it's saying, bro" rhetoric.

评论 #40441380 未加载

评论 #40437912 未加载

评论 #40445446 未加载

byteknight12 months ago

This reminds me of how people often communicate to avoid offending others. We tend to soften our opinions or suggestions with phrases like "What if you looked at it this way?" or "You know what I'd do in those situations." By doing this, we subtly dilute the exact emotion or truth we're trying to convey. If we modify our words enough, we might end up with a statement that's completely untruthful. This is similar to how AI models might behave when manipulated to emphasize certain features, leading to responses that are not entirely genuine.

评论 #40429580 未加载

评论 #40429898 未加载

justanotherjoe12 months ago

My thoughts- LLM Just got a whole set of buttons you can push. Potential for the LLM to push its own buttons?- Read the paper and ctrl+f 'deplorable'. This shows once again how we are underestimating LLM's ability to appear conscious. It can be really effective. Reminiscence of Dr.Ford in Westworld :'you (robots) never look more human than when you are suffering.' Or something like that, anyway. I might be hallucinating dialogue but pretty sure something like that was said and I think it's quite true.- Intensely realistic roleplaying potential unlocked.- Efficiency by reducing context length by directly amplifying certain features instead.Very powerful stuff. I am waiting eagerly when I can play with it myself. (Someone please make it a local feature)

pdevr12 months ago

So, to summarize:>Used "dictionary learning">Found abstract features>Found similar/close features using distance>Tried amplifying and suppressing featuresNot trying to be snary, but sounds mundane in the ML/LLM world. Then again, significant advances have come from simple concepts. Would love to hear from someone who has been able to try this out.

评论 #40431136 未加载

HanClinto12 months ago

Reminds me of this paper from a couple of weeks ago that isolated the "refusal vector" for prompts that caused the model to decline to answer certain prompts:<a href="https://news.ycombinator.com/item?id=40242939">https://news.ycombinator.com/item?id=40242939</a>I love seeing the work here -- especially the way that they identified a vector specifically for bad code. I've been trying to explore the way that we can use adversarial training to increase the quality of code generated by our LLMs, and so using this technique to get countering examples of secure vs. insecure code (to bootstrap the training process) is really exciting.Overall, fascinating stuff!!

null_point12 months ago

Strategic timing for the release of this paper. As of last week OpenAI looks weak in their commitment to _AI Safety_, losing key members of their Super Alignment team.

wwarner12 months ago

huge. the activation scan, which looks for which nodes change the most when prompted with the words "Golden Gate Bridge" and later an image of the same bridge, is eerily reminiscent of a brain scan under similar prompts...

评论 #40429890 未加载

评论 #40429981 未加载

whimsicalism12 months ago

I continue to be impressed by Anthropic’s work and their dual commitment to scaling and safety.HN is often characterized by a very negative tone related to any of these developments, but I really do feel that Anthropic is trying to do a “race to the top” in terms of alignment, though it doesn’t seem like all the other major companies are doing enough to race with them.Particularly frustrating on HN is the common syllogism of: 1. I believe anything that “thinks” must do X thing. 2. LLM doesn’t do X thing 3. LLM doesn’t thinkX thing is usually both poorly justified as constitutive of thinking (usually constitutive of human thinking but not writ large) nor is it explained why it matters whether the label of “thinking” applies to LLM or not if the capabilities remain the same.

评论 #40429967 未加载

评论 #40430005 未加载

astrange12 months ago

> Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).This seems like it's trivially true; if you find two different features for a concept in two different languages, just combine them and now you have a "multilingual feature".Or are all of these features the same "size"? They might be and I might've missed it.

parentheses12 months ago

I wonder how interpretability and training can interplay. Some examples:Imagine taking Claude, tweaking weights relevant to X and then fine tuning it on knowledge related to X. It could result in more neurons being recruited to learn about X.Imagine performing this during training to amplify or reduce the importance of certain topics. Train it on a vast corpus, but tune at various checkpoints to ensure the neural network's knowledge distribution skews. This could be a way to get more performance from MoE models.I am not an expert. Just putting on my generalist hat here. Tell me I'm wrong because I'd be fascinated to hear the reasons.

quotemstr12 months ago

At this risk of anthropomorphizing too much, I can't help but see parallels between the "my physical form is the Golden Gate Bridge" screenshot and the <a href="https://en.wikipedia.org/wiki/God_helmet" rel="nofollow">https://en.wikipedia.org/wiki/God_helmet</a> in humans --- both cognitive distortions caused by targeted exogenous neural activation.

maciejgryka12 months ago

I recorded myself trying to read through and understand the high-level of this if anyone's interested in following along: <a href="https://maciej.gryka.net/papers-in-public/#scaling-monosemanticity" rel="nofollow">https://maciej.gryka.net/papers-in-public/#scaling-monoseman...</a>

tantalor12 months ago

I always assumed the way to map these models would be by ablation, the same way we map the animal brain.Damage part X of the network and see what happens. If the subject loses the ability to do Y, then X is responsible for Y.See <a href="https://en.wikipedia.org/wiki/Phineas_Gage" rel="nofollow">https://en.wikipedia.org/wiki/Phineas_Gage</a>

评论 #40431756 未加载

parentheses12 months ago

It's interesting that they used this to manipulate models. I wonder if "intentions" can be found and tuned. That would have massive potential for use and misuse. I could imagine a villain taking a model and amplifying "the evil" using a similar technique.

bilsbie12 months ago

If anyone wants to team up and work on stuff like this (on toy models so we can run locally) please get in touch. (Email in profile)I’m so fascinated by this stuff but I’m having trouble staying motivated in this short attention span world.

gdiamos12 months ago

It looks like Anthropic is now leading the charge on safety

评论 #40456745 未加载

wrycoder12 months ago

They are trying to figure out what they actually built.I suspect the time is coming when there will always be an aligned search AI between you and the internet.

pagekicker12 months ago

The article doesn't explain how users can exploit these features in UI or prompt. Does anyone have any insight on how to do so?

评论 #40435715 未加载

watersb12 months ago

Am I the only one to read 'monosemanticity' as 'moose-mantically'?Like, its talking about moose magick...

gautomdas12 months ago

I've really been enjoying their series on mech interp, does anyone have any other good recs?

评论 #40441436 未加载

评论 #40437371 未加载

sanxiyn12 months ago

Someone should do this for Llama 3.

bilsbie12 months ago

How are they handling attention in their approach?That’s going to completely change what features are looked at.

评论 #40439840 未加载

youssefabdelm12 months ago

For anyone who has read the paper, have they provided code examples or enough detail to recreate this with, say, Llama 3?While they're concerned with safety, I'm much more interested in this as a tool for controllability. Maybe we can finally get rid of the woke customer service tone, and get AI to be more eclectic and informative, and less watered down in its responses.

feverzsj12 months ago

So they made a system by trying out thousands of combinations to find the one gives best result, but they don't understand what's actually going on inside.

optimalsolver12 months ago

>what the model is "thinking" before writing its responseAn actual "thinking machine" would be constantly running computations on its accumulated experience in order to improve its future output and/or further compress its sensory history.An LLM is doing exactly nothing while waiting for the next prompt.

评论 #40429493 未加载

评论 #40429606 未加载

评论 #40429847 未加载

评论 #40429761 未加载

评论 #40429486 未加载

28 comments

tel12 months ago

评论 #40431398 未加载

e63f67dd-065b12 months ago

评论 #40437384 未加载

评论 #40436897 未加载

评论 #40438098 未加载

bjterry12 months ago

评论 #40435401 未加载

评论 #40436786 未加载

kromem12 months ago

评论 #40441380 未加载

评论 #40437912 未加载

评论 #40445446 未加载

byteknight12 months ago

评论 #40429580 未加载

评论 #40429898 未加载

justanotherjoe12 months ago

pdevr12 months ago

评论 #40431136 未加载

HanClinto12 months ago

null_point12 months ago

Strategic timing for the release of this paper. As of last week OpenAI looks weak in their commitment to _AI Safety_, losing key members of their Super Alignment team.

wwarner12 months ago

评论 #40429890 未加载

评论 #40429981 未加载

whimsicalism12 months ago

评论 #40429967 未加载

评论 #40430005 未加载

astrange12 months ago

parentheses12 months ago

quotemstr12 months ago

maciejgryka12 months ago

tantalor12 months ago

评论 #40431756 未加载

parentheses12 months ago

bilsbie12 months ago

gdiamos12 months ago

It looks like Anthropic is now leading the charge on safety

评论 #40456745 未加载

wrycoder12 months ago

They are trying to figure out what they actually built.I suspect the time is coming when there will always be an aligned search AI between you and the internet.

pagekicker12 months ago

The article doesn't explain how users can exploit these features in UI or prompt. Does anyone have any insight on how to do so?

评论 #40435715 未加载

watersb12 months ago

Am I the only one to read 'monosemanticity' as 'moose-mantically'?Like, its talking about moose magick...

gautomdas12 months ago

I've really been enjoying their series on mech interp, does anyone have any other good recs?

评论 #40441436 未加载

评论 #40437371 未加载

sanxiyn12 months ago

Someone should do this for Llama 3.

bilsbie12 months ago

How are they handling attention in their approach?That’s going to completely change what features are looked at.

评论 #40439840 未加载

youssefabdelm12 months ago

feverzsj12 months ago

So they made a system by trying out thousands of combinations to find the one gives best result, but they don't understand what's actually going on inside.