TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

GPT o3 frequently fabricates actions, then elaborately justifies these actions

75 pointsby occamschainsaw27 days ago

11 comments

latexr27 days ago
&gt; These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities.<p>It is only surprising to those who refuse to understand how LLMs work and continue to anthropomorphise them. There is no being “truthful” here, the model has no concept of right or wrong, true or false. It’s not “lying” to you, it’s spitting out text. It just so happens that <i>sometimes</i> that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
评论 #43713752 未加载
评论 #43714321 未加载
评论 #43713887 未加载
评论 #43717235 未加载
评论 #43714022 未加载
评论 #43714531 未加载
评论 #43714957 未加载
评论 #43714505 未加载
评论 #43713764 未加载
评论 #43713891 未加载
glial27 days ago
I enjoy watching newer-generation models exhibit symptoms that echo features of human cognition. This particular one is reminiscent of the confabulation seen in split-brain patients, e.g. <a href="https:&#x2F;&#x2F;www.edge.org&#x2F;response-detail&#x2F;11513" rel="nofollow">https:&#x2F;&#x2F;www.edge.org&#x2F;response-detail&#x2F;11513</a>
SillyUsername27 days ago
o3 has been the worst model of the new 3 for me.<p>Ask it to create a Typescript server side hello world.<p>It produces a JS example.<p>Telling it that&#x27;s incorrect (but no more detail) results in it iterating all sorts of mistakes.<p>In 20 iterations it never once asked me what was incorrect.<p>In contrast, o4-mini asked me after 5, o4-mini-high asked me after 1, but narrowed the question to &quot;is it incorrect due to choice of runtime?&quot; rather than &quot;what&#x27;s incorrect?&quot;<p>I told it to &quot;ask the right question&quot; based on my statement (&quot;it is incorrect&quot;) and it correctly asked &quot;what is wrong with it?&quot; before I pointed out no Typescript types.<p>This is the critical thinking we need not just reasoning (incorrectly).
评论 #43713983 未加载
TobiWestside27 days ago
I&#x27;m confused - the post says &quot;o3 does not have access to a coding tool&quot;.<p>However, OpenAI mentiones a Python tool multiple times in the system card [1], e.g.: &quot;OpenAI o3 and OpenAI o4-mini combine state-of-the-art reasoning with full tool capabilities—web browsing, Python, [...]&quot;<p>&quot;The models use tools in their chains of thought to augment their capabilities; for example, cropping or transforming images, searching the web, or using Python to analyze data during their thought process.&quot;<p>I interpreted this to mean o3 <i>does</i> have access to a tool that enables it to run code. Is my understanding wrong?<p>[1] <a href="https:&#x2F;&#x2F;openai.com&#x2F;index&#x2F;o3-o4-mini-system-card&#x2F;" rel="nofollow">https:&#x2F;&#x2F;openai.com&#x2F;index&#x2F;o3-o4-mini-system-card&#x2F;</a>
评论 #43714354 未加载
bjackman27 days ago
I don&#x27;t understand why the UIs don&#x27;t make this obvious. When the model runs code, why can&#x27;t the system just show us the code and its output, in a special UI widget that the model can&#x27;t generate any other way?<p>Then if it says &quot;I ran this code and it says X&quot; we can easily verify. This is a big part of the reason I want LLMs to run code.<p>Weirdly I have seen Gemini write code and make claims about the output. I can see the code, the claims it makes about the output are correct. I do not think it could make these correct claims without running the code. But the UI doesn&#x27;t show me this. To verify it, I have to run the code myself. This makes the whole feature way less valuable and I don&#x27;t understand why!
jjani27 days ago
Power user here, working with these models (the whole gamut) side-by-side on a large range of tasks has been my daily work since they came out.<p>I can vouch that this is extremely characteristic of o3-mini compared to competing models (Claude, Gemini) and previous OA models (3.5, 4o).<p>Compared to those, o3-mini clearly has less of the &quot;the user is always right&quot; training. This is almost certainly intentional. At times, this can be useful - it&#x27;s more willing to call you out when you&#x27;re wrong, and less likely to agree with something just because you suggested it. But this excessive stubbornness is the great downside, and it&#x27;s been so prevalent that I stopped using o3-mini.<p>I haven&#x27;t had enough time with o3 yet, but if it is indeed an evolution of o3-mini, it comes at no surprise it&#x27;s very bad for this as well.
评论 #43714340 未加载
评论 #43714240 未加载
rsynnott27 days ago
So people keep claiming that these things are like junior engineers, and, increasingly, it seems as if they are instead like the worst possible _stereotype_ of junior engineers.
ramesh3127 days ago
Reasoning models are complete nonsense in the face of custom agents. I would love to be proven wrong here.
LZ_Khan27 days ago
Um.. wasn&#x27;t this what was mentioned was going to happen in AI 2027?<p>&quot;In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings.&quot;
YetAnotherNick27 days ago
I wish there are benchmarks for these scenarios. Anyone who has used LLMs know that they are very different from human. And after certain context, it become irritating to talk to these LLMs.<p>I don&#x27;t want my LLM to excel in IMO or codeforces. I want it to understand my significantly easier but complex to state problem, think of solutions, understand its own issues and resolve it, rather than be passive agressive.
评论 #43714024 未加载
评论 #43713902 未加载
anshumankmr27 days ago
Is it just me or it feels like bit of a dissapointment? I have been using it for some hours now, and its needlessly convoluting the code.
评论 #43714261 未加载