The thing that really, really takes the cake here?<p>This whole thread is like, the first thing in the article. I hate to say "if you read the article..." but if the shoe fits...<p>The Discussion We Keep Having:<p>Every time, we go through the same discussion, between Alice and Bob (I randomized who is who):<p>Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.<p>Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.<p>Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.<p>Alice: It’s just role playing! It’s just echoing stuff in the training data!<p>Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.<p>Alice: It’s harmless! These models aren’t dangerous!<p>Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).<p>Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!<p>Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so?
And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this?
And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this?
And you can’t simply say ‘well we won’t do that then’?<p>Alice: For all practical purposes, no!<p>Bob: What do you mean, ‘no’?<p>Alice: No!<p>Bob: ARRRRGGGGHHHH!<p>Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.