> These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities.<p>It is only surprising to those who refuse to understand how LLMs work and continue to anthropomorphise them. There is no being “truthful” here, the model has no concept of right or wrong, true or false. It’s not “lying” to you, it’s spitting out text. It just so happens that <i>sometimes</i> that non-deterministic text aligns with reality, but you don’t really know when and neither does the model.
I enjoy watching newer-generation models exhibit symptoms that echo features of human cognition. This particular one is reminiscent of the confabulation seen in split-brain patients, e.g. <a href="https://www.edge.org/response-detail/11513" rel="nofollow">https://www.edge.org/response-detail/11513</a>
o3 has been the worst model of the new 3 for me.<p>Ask it to create a Typescript server side hello world.<p>It produces a JS example.<p>Telling it that's incorrect (but no more detail) results in it iterating all sorts of mistakes.<p>In 20 iterations it never once asked me what was incorrect.<p>In contrast, o4-mini asked me after 5, o4-mini-high asked me after 1, but narrowed the question to "is it incorrect due to choice of runtime?" rather than "what's incorrect?"<p>I told it to "ask the right question" based on my statement ("it is incorrect") and it correctly asked "what is wrong with it?" before I pointed out no Typescript types.<p>This is the critical thinking we need not just reasoning (incorrectly).
I'm confused - the post says "o3 does not have access to a coding tool".<p>However, OpenAI mentiones a Python tool multiple times in the system card [1], e.g.:
"OpenAI o3 and OpenAI o4-mini combine state-of-the-art reasoning with full tool capabilities—web browsing, Python, [...]"<p>"The models use tools in their chains of thought to augment their capabilities; for example, cropping or transforming images, searching the web, or using Python to analyze data during their thought process."<p>I interpreted this to mean o3 <i>does</i> have access to a tool that enables it to run code. Is my understanding wrong?<p>[1] <a href="https://openai.com/index/o3-o4-mini-system-card/" rel="nofollow">https://openai.com/index/o3-o4-mini-system-card/</a>
I don't understand why the UIs don't make this obvious. When the model runs code, why can't the system just show us the code and its output, in a special UI widget that the model can't generate any other way?<p>Then if it says "I ran this code and it says X" we can easily verify. This is a big part of the reason I want LLMs to run code.<p>Weirdly I have seen Gemini write code and make claims about the output. I can see the code, the claims it makes about the output are correct. I do not think it could make these correct claims without running the code. But the UI doesn't show me this. To verify it, I have to run the code myself. This makes the whole feature way less valuable and I don't understand why!
Power user here, working with these models (the whole gamut)
side-by-side on a large range of tasks has been my daily work since they came out.<p>I can vouch that this is extremely characteristic of o3-mini compared to competing models (Claude, Gemini) and previous OA models (3.5, 4o).<p>Compared to those, o3-mini clearly has less of the "the user is always right" training. This is almost certainly intentional. At times, this can be useful - it's more willing to call you out when you're wrong, and less likely to agree with something just because you suggested it. But this excessive stubbornness is the great downside, and it's been so prevalent that I stopped using o3-mini.<p>I haven't had enough time with o3 yet, but if it is indeed an evolution of o3-mini, it comes at no surprise it's very bad for this as well.
So people keep claiming that these things are like junior engineers, and, increasingly, it seems as if they are instead like the worst possible _stereotype_ of junior engineers.
Um.. wasn't this what was mentioned was going to happen in AI 2027?<p>"In a few rigged demos, it even lies in more serious ways, like hiding evidence that it failed on a task, in order to get better ratings."
I wish there are benchmarks for these scenarios. Anyone who has used LLMs know that they are very different from human. And after certain context, it become irritating to talk to these LLMs.<p>I don't want my LLM to excel in IMO or codeforces. I want it to understand my significantly easier but complex to state problem, think of solutions, understand its own issues and resolve it, rather than be passive agressive.