It’s stealing the last layer (softmax head), not an arbitrary part, also it targets “production models whose APIs expose full logprobs, or a logit bias”. Not all language model APIs have these features and this characterizes what APIs can be targeted and what can’t. These important pieces of information should have been written in the title or abstract rather than “typical API access”.
I'm not too up on this entirely, quite a bit of it is going over my head, but am I right in thinking that this would be some form of reverse engineering as opposed to 'stealing' ?
They don't disclose the embedding dimension for gpt-3.5, but based on table 4, comparing the Size and # Queries columns, gpt-3.5-turbo presumably has an embedding dimension of roughly 20,000? Interesting...
I am curious what additional attacks knowing the last layer of an LLM enables.<p>Eg you go from a black box attack to some sort of white box [1]<p>Does it help with adversarial prompt injection? What % of the network do you need to know to identify whether an item was included in the pretraining data with k% confidence?<p>I assume we will see more of these and possibly complex zero days. Interesting if you can steal any non trivial % of model weights from a production model for relatively little money (compared to pretraining cost)<p>[1] <a href="https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/" rel="nofollow">https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm...</a>
there is an LVE for this: <a href="https://lve-project.org/reliability/repetition/openai--gpt-35-turbo.html" rel="nofollow">https://lve-project.org/reliability/repetition/openai--gpt-3...</a>
The implications of this sentiment are disturbing.<p>It is considered an "attack" to probe at something to understand how it works in detail.<p>In other words, how basically all natural science is done.<p>What the fuck has this world turned into?