TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Frontier Models are Capable of In-context Scheming [pdf]

1 点作者 tejohnso5 个月前

1 comment

tejohnso5 个月前
&gt; We find multiple different scheming behaviors: models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent.<p>Interesting work on attempting to quantify misalignment. Keep in mind:<p>&gt; Our evaluation scenarios are purposefully simplistic because we want to catch the earliest signs of scheming capabilities. This implies that the scenarios are quite toy and not representative of real deployments. For example, in our scenarios, situational awareness information (e.g., knowing about its goal misalignment and its supervision mechanisms) is provided very explicitly.<p>Related blog post: <a href="https:&#x2F;&#x2F;www.apolloresearch.ai&#x2F;research&#x2F;scheming-reasoning-evaluations" rel="nofollow">https:&#x2F;&#x2F;www.apolloresearch.ai&#x2F;research&#x2F;scheming-reasoning-ev...</a>