TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

OpenAI's new reasoning AI models hallucinate more

123 点作者 almog22 天前

16 条评论

vessenes22 天前
One possible explanation here: as these get smarter, they lie more to satisfy requests.<p>I witnessed a very interesting thing yesterday, playing with o3. I gave it a photo and asked it to play geoguesser with me. It pretty quickly inside its thinking zone pulled up python, and extracted coordinates from EXIF. It then proceeded to explain it had properly identified some physical features from the photo. No mention of using EXIF GPS data.<p>When I called it on the lying it was like &quot;hah, yep.&quot;<p>You could interpret from this that it&#x27;s not aligned, that it&#x27;s trying to make sure it does what I asked it (tell me where the photo is), that it&#x27;s evil and forgot to hide it, lots of possibilities. But I found the interaction notable and new. Older models often double down on confabulations&#x2F;hallucinations, even under duress. This looks to me from the outside like something slightly different.<p><a href="https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;6802e229-c6a0-800f-898a-44171a0c7de4" rel="nofollow">https:&#x2F;&#x2F;chatgpt.com&#x2F;share&#x2F;6802e229-c6a0-800f-898a-44171a0c7d...</a>
评论 #43733145 未加载
评论 #43733536 未加载
评论 #43732908 未加载
评论 #43733325 未加载
评论 #43733208 未加载
评论 #43732991 未加载
评论 #43733195 未加载
评论 #43734152 未加载
评论 #43733843 未加载
评论 #43735083 未加载
评论 #43733550 未加载
评论 #43735881 未加载
评论 #43733124 未加载
评论 #43735094 未加载
评论 #43805806 未加载
评论 #43733129 未加载
评论 #43734191 未加载
评论 #43733178 未加载
billti22 天前
If it’s predicting a next token to maximize scores against a training&#x2F;test set, naively, wouldn’t that be expected?<p>I would imagine very little of the training data consists of a question followed by an answer of “I don’t know”, thus making it statistically very unlikely as a “next token”.
评论 #43735850 未加载
simianwords22 天前
My prediction: this is because of tool use. All models by OpenAI hallucinate more once tool use is given. I noticed this even with 4o with web search. With and without websearch I have noticed a huge difference in understanding capabilities.<p>I predict that O3 will hallucinate less if you ask it not to use any tools.
评论 #43734827 未加载
msadowski22 天前
Anyone has any stories on companies overusing AI? I’ve had some very frustrating encounters already when non-technical people were trying to help by sending AI solution to the issue which totally didn’t make any sense. I liked how the researchers in this work [1] prose calling LLM output “Frankfurtian BS”. I think it’s very fitting.<p>[1] <a href="https:&#x2F;&#x2F;ntrs.nasa.gov&#x2F;citations&#x2F;20250001849" rel="nofollow">https:&#x2F;&#x2F;ntrs.nasa.gov&#x2F;citations&#x2F;20250001849</a>
serjester22 天前
Anecdotally o3 is the first OpenAI model in a while that I have to double check if it&#x27;s dropping important pieces of my code.
评论 #43732828 未加载
saithound22 天前
OpenAI o3 and o4-mini are massive disappointments for me so far. I have a private benchmark of 10 proof-based geometric group theory questions that I throw at new models upon release.<p>Both new models gave inconsistent answers, always with wrong or fake proofs, or using assumptions that are not in the queation, and are often outright unsatisfiable.<p>The now inaccessible o3-mini was not great, but much better than o3 and o4-mini at these questions: o3-mini can give approximately correct proof sketches for half of them, whereas I can&#x27;t get a single correct proof sketch out of o3 full. o4-mini performs slightly worse than o3-mini. I think the allegations that OpenAI cheated FrontierMath have unambiguously been proven correct by this release.
rzz322 天前
Does anyone have any technical insight on what actually causes the hallucinations? I know it’s an ongoing area of research, but do we have a lead?
评论 #43732620 未加载
评论 #43733736 未加载
评论 #43732624 未加载
评论 #43732769 未加载
评论 #43732926 未加载
评论 #43732789 未加载
评论 #43733043 未加载
评论 #43732728 未加载
评论 #43735880 未加载
评论 #43735039 未加载
评论 #43732590 未加载
评论 #43733508 未加载
pllbnk18 天前
With my limited knowledge, I can&#x27;t help but wonder, aren&#x27;t current Transformer-based LLMs facing the five nines problem of their own? We&#x27;re reaching a point where next token prediction accuracy improves merely linearly (maybe even on a logarithmic scale?) with additional parameters, while errors compound exponentially across longer sequences.<p>Even if a 5T parameter model improves prediction accuracy from 99.999% to 99.9999% compared to a 500B model, hallucinations persist because these small probabilities of error multiply dramatically over many tokens. Temperature settings just trade between repetitive certainty and creative inconsistency.
the_snooze22 天前
With all the money, research, and hype going into these LLM systems over the past few years, I can&#x27;t help but ask: if I still can&#x27;t rely on them for simple easy-to-check use cases for which there&#x27;s <i>a lot</i> of good training data out there (e.g., sports trivia [1]), isn&#x27;t it deeply irresponsible to use them for any non-toy application?<p>[1] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43669364">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43669364</a>
评论 #43734008 未加载
评论 #43733525 未加载
评论 #43746678 未加载
评论 #43733244 未加载
评论 #43733080 未加载
taf222 天前
I think for intelligence it’s a fine line between a lie and creativity
evo_922 天前
Maybe they need to evoke a sort of sleep so they can clear these out while dreaming, sorta like if humans don’t sleep enough hallucination start penetrating waking life…
评论 #43733742 未加载
czk22 天前
will be interesting to see how they tighten the reward signal &#x2F; ground outputs in some verifiable context. don&#x27;t reward it for sounding right (rlhf), reward it for being right. but you&#x27;d probably need some sort of system to backprop a fact-checked score, and i imagine that would slow down training quite a bit. if the verifier finds a false claim it should reward the model for saying &quot;i dont know&quot;
jablongo22 天前
In my experience this is true. One workflow I really hate is trying to convince an AI that it is hallucinating so it can get back to the task at hand.
daxfohl22 天前
Maybe the fact that the answers <i>sound</i> more intelligent ends up poisoning the RLHF results used for fine tuning.
mstipetic22 天前
I used it yesterday to help me with some visual riddle and I had some hints to the shape of the solution. It was gaslighting me completely that I’m pasting in the image wrong and it drew whole tables explaining how it’s right. It was saying things like “I swear in the original photo the top row is empty” and was fudging the calculation to prove it was right. It was very frustrating. I am not using it again.
varispeed21 天前
I tried o3 few times, it more resembles a Markov chain generator than intelligence. Disappointed as well.