TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

GPT o3 frequently fabricates actions, then elaborately justifies these actions

75 点作者 occamschainsaw28 天前

11 条评论

latexr28 天前
&gt; These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities.<p>It is only surprising to those who refuse to understand how LLMs work and continue to anthropomorphise them. There is no being “truthful” here, the model has no concept of right or wrong, true or false. It’s not “lying” to you, it’s spitting out text. It just so happens that <i>sometimes</i> that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
评论 #43713752 未加载
评论 #43714321 未加载
评论 #43713887 未加载
评论 #43717235 未加载
评论 #43714022 未加载
评论 #43714531 未加载
评论 #43714957 未加载
评论 #43714505 未加载
评论 #43713764 未加载
评论 #43713891 未加载
glial28 天前
I enjoy watching newer-generation models exhibit symptoms that echo features of human cognition. This particular one is reminiscent of the confabulation seen in split-brain patients, e.g. <a href="https:&#x2F;&#x2F;www.edge.org&#x2F;response-detail&#x2F;11513" rel="nofollow">https:&#x2F;&#x2F;www.edge.org&#x2F;response-detail&#x2F;11513</a>
SillyUsername28 天前
o3 has been the worst model of the new 3 for me.<p>Ask it to create a Typescript server side hello world.<p>It produces a JS example.<p>Telling it that&#x27;s incorrect (but no more detail) results in it iterating all sorts of mistakes.<p>In 20 iterations it never once asked me what was incorrect.<p>In contrast, o4-mini asked me after 5, o4-mini-high asked me after 1, but narrowed the question to &quot;is it incorrect due to choice of runtime?&quot; rather than &quot;what&#x27;s incorrect?&quot;<p>I told it to &quot;ask the right question&quot; based on my statement (&quot;it is incorrect&quot;) and it correctly asked &quot;what is wrong with it?&quot; before I pointed out no Typescript types.<p>This is the critical thinking we need not just reasoning (incorrectly).
评论 #43713983 未加载
TobiWestside28 天前
I&#x27;m confused - the post says &quot;o3 does not have access to a coding tool&quot;.<p>However, OpenAI mentiones a Python tool multiple times in the system card [1], e.g.: &quot;OpenAI o3 and OpenAI o4-mini combine state-of-the-art reasoning with full tool capabilities—web browsing, Python, [...]&quot;<p>&quot;The models use tools in their chains of thought to augment their capabilities; for example, cropping or transforming images, searching the web, or using Python to analyze data during their thought process.&quot;<p>I interpreted this to mean o3 <i>does</i> have access to a tool that enables it to run code. Is my understanding wrong?<p>[1] <a href="https:&#x2F;&#x2F;openai.com&#x2F;index&#x2F;o3-o4-mini-system-card&#x2F;" rel="nofollow">https:&#x2F;&#x2F;openai.com&#x2F;index&#x2F;o3-o4-mini-system-card&#x2F;</a>
评论 #43714354 未加载
bjackman28 天前
I don&#x27;t understand why the UIs don&#x27;t make this obvious. When the model runs code, why can&#x27;t the system just show us the code and its output, in a special UI widget that the model can&#x27;t generate any other way?<p>Then if it says &quot;I ran this code and it says X&quot; we can easily verify. This is a big part of the reason I want LLMs to run code.<p>Weirdly I have seen Gemini write code and make claims about the output. I can see the code, the claims it makes about the output are correct. I do not think it could make these correct claims without running the code. But the UI doesn&#x27;t show me this. To verify it, I have to run the code myself. This makes the whole feature way less valuable and I don&#x27;t understand why!
jjani28 天前
Power user here, working with these models (the whole gamut) side-by-side on a large range of tasks has been my daily work since they came out.<p>I can vouch that this is extremely characteristic of o3-mini compared to competing models (Claude, Gemini) and previous OA models (3.5, 4o).<p>Compared to those, o3-mini clearly has less of the &quot;the user is always right&quot; training. This is almost certainly intentional. At times, this can be useful - it&#x27;s more willing to call you out when you&#x27;re wrong, and less likely to agree with something just because you suggested it. But this excessive stubbornness is the great downside, and it&#x27;s been so prevalent that I stopped using o3-mini.<p>I haven&#x27;t had enough time with o3 yet, but if it is indeed an evolution of o3-mini, it comes at no surprise it&#x27;s very bad for this as well.
评论 #43714340 未加载
评论 #43714240 未加载
rsynnott27 天前
So people keep claiming that these things are like junior engineers, and, increasingly, it seems as if they are instead like the worst possible _stereotype_ of junior engineers.
ramesh3128 天前
Reasoning models are complete nonsense in the face of custom agents. I would love to be proven wrong here.
LZ_Khan28 天前
Um.. wasn&#x27;t this what was mentioned was going to happen in AI 2027?<p>&quot;In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings.&quot;
YetAnotherNick28 天前
I wish there are benchmarks for these scenarios. Anyone who has used LLMs know that they are very different from human. And after certain context, it become irritating to talk to these LLMs.<p>I don&#x27;t want my LLM to excel in IMO or codeforces. I want it to understand my significantly easier but complex to state problem, think of solutions, understand its own issues and resolve it, rather than be passive agressive.
评论 #43714024 未加载
评论 #43713902 未加载
anshumankmr28 天前
Is it just me or it feels like bit of a dissapointment? I have been using it for some hours now, and its needlessly convoluting the code.
评论 #43714261 未加载