TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

174 点作者 enum3 个月前

14 条评论

XCabbage3 个月前
This doesn&#x27;t feel like a &quot;reasoning&quot; challenge. The mental skill required to solve most of these seems to be the ability to loop over all known members of a category like &quot;popular brand names&quot; or &quot;well-known actors&quot; and see if they fit the clue.<p>As a human, you&#x27;d expect to fail either because you didn&#x27;t know a category member (e.g. as a non-American I have no idea WTF &quot;Citgo&quot; is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren&#x27;t able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.<p>Admittedly this is something an AI system could in principle be REALLY good at, and it&#x27;s interesting to test and see that current ones are not! But it seems weird to me to call what&#x27;s being tested &quot;reasoning&quot; when it&#x27;s so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you&#x27;ve brought it to mind and doesn&#x27;t really require any intelligent thought).<p>(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I&#x27;d agree it was a &quot;reasoning&quot; test.)
评论 #42994880 未加载
评论 #42994963 未加载
评论 #42993230 未加载
rahidz3 个月前
What is so interesting to me is that the reasoning traces for these often have the correct answer, but the model fails to realize it.<p>Problem 3 (&quot;Dry Eye&quot;), R1: &quot;Wait, maybe &quot;cubitus valgus&quot; – no, too long. Wait, three letters each. Let me think again. Maybe &quot;hay fever&quot; is two words but not three letters each. Maybe &quot;dry eye&quot;? &quot;Dry&quot; and &quot;eye&quot; – both three letters. &quot;Dry eye&quot; is a condition. Do they rhyme? &quot;Dry&quot; (d-rye) and &quot;eye&quot; (i) – no, they don&#x27;t rhyme. &quot;Eye&quot; is pronounced like &quot;i&quot;, while &quot;dry&quot; is &quot;d-rye&quot;. Not the same ending.&quot;<p>Problem 8 (&quot;Foot nose&quot;), R1: &quot;Wait, if the seventh letter is changed to next letter, maybe the original word is &quot;footnot&quot; (but that&#x27;s not a word). Alternatively, maybe &quot;foot&quot; + &quot;note&quot;, but &quot;note&quot; isn&#x27;t a body part.&quot;
评论 #42994610 未加载
评论 #42994801 未加载
评论 #42994816 未加载
评论 #42996025 未加载
mkoubaa3 个月前
The term &quot;PhD Knowledge&quot; makes me cringe. PhD is about contribution to a field, not a standardized test score or a weird status-bearing substitute for IQ.
评论 #42993707 未加载
评论 #42993914 未加载
评论 #42994021 未加载
评论 #42994009 未加载
评论 #42994336 未加载
windsignaling3 个月前
I&#x27;m not a fan of these &quot;gotchas&quot; because they don&#x27;t test for what we really care about.<p>Like counting the number of R&#x27;s in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.<p>I&#x27;m sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.<p>IMO these questions go in the wrong direction. Character permutation is a problem for &quot;Software 1.0&quot;, not LLMs. Just as you wouldn&#x27;t use an LLM to multiply 2 large numbers, you&#x27;d use a calculator.
评论 #43001655 未加载
评论 #42998204 未加载
评论 #42998917 未加载
评论 #43000857 未加载
enum3 个月前
Results and dataset explorer here: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;nuprl&#x2F;verbal-reasoning-challenge" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;nuprl&#x2F;verbal-reasoning-challen...</a>
评论 #42992817 未加载
评论 #42993866 未加载
sega_sai3 个月前
Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...
评论 #42992660 未加载
lokimedes3 个月前
As if the whole anti-intellectual hunt wasn’t enough, now PhD is a category implying holder of rote-knowledge at the highest level. I guess it is hopeless to fight this, but a PhD is 100x more about the apprenticeship and real-world training as a scientist than any accumulated knowledge beyond ones prior training.<p>I know this is a rant, sorry, just so tired of the stupidity.
评论 #42994227 未加载
评论 #42994041 未加载
评论 #42996455 未加载
zone4113 个月前
I have a set of independent benchmarks and most also show a difference between reasoning and non-reasoning models:<p>LLM Confabulation (Hallucination): <a href="https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;confabulations&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;confabulations&#x2F;</a><p>LLM Step Game: <a href="https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;step_game">https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;step_game</a><p>LLM Thematic Generalization Benchmark: <a href="https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;generalization">https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;generalization</a><p>LLM Creative Story-Writing Benchmark: <a href="https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;writing">https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;writing</a><p>Extended NYT Connections LLM Benchmark: <a href="https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;nyt-connections&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;lechmazur&#x2F;nyt-connections&#x2F;</a><p>and a couple more that I haven&#x27;t updated very recently.
akomtu3 个月前
The reasoning challenge is made of two parts:<p>1. Can you apply an existing model to a problem? For example: you&#x27;re told how to multiply numbers and asked to multiply AHFG by VRBD in base-26 system.<p>2. Can you come up with a model that explains the given examples? For example: you&#x27;re given 10 triples like AxB=C and asked to explain what they have in common.<p>Simply imitating answers won&#x27;t get you very far.
zinccat3 个月前
My feeling is that a lot of challenge could come from the tokenizer used by the model, similar to r in strawberry problem.
评论 #42993033 未加载
scotty793 个月前
If you want to have a problem that is fairly easy for humans but hard for LLMs it should have solution that requires iteratively applying same steps few times. Perhaps conditionally. I predict that LLMs even in chain-of-thought should drop the ball after just few iterations.
评论 #43010238 未加载
brokensegue3 个月前
Are these really reasoning challenges? Seems like they are really solved via brute force or guess and check
评论 #42995650 未加载
bryan03 个月前
Are LLMs not trained on NPR transcripts?
aghilmort3 个月前
really great work! are you a co-author?
评论 #42992624 未加载