This doesn't feel like a "reasoning" challenge. The mental skill required to solve most of these seems to be the ability to loop over all known members of a category like "popular brand names" or "well-known actors" and see if they fit the clue.<p>As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.<p>Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).<p>(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)
What is so interesting to me is that the reasoning traces for these often have the correct answer, but the model fails to realize it.<p>Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" – no, too long. Wait, three letters each. Let me think again. Maybe "hay fever" is two words but not three letters each. Maybe "dry eye"? "Dry" and "eye" – both three letters. "Dry eye" is a condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) – no, they don't rhyme. "Eye" is pronounced like "i", while "dry" is "d-rye". Not the same ending."<p>Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is changed to next letter, maybe the original word is "footnot" (but that's not a word). Alternatively, maybe "foot" + "note", but "note" isn't a body part."
The term "PhD Knowledge" makes me cringe. PhD is about contribution to a field, not a standardized test score or a weird status-bearing substitute for IQ.
I'm not a fan of these "gotchas" because they don't test for what we really care about.<p>Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.<p>I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.<p>IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.
Results and dataset explorer here: <a href="https://huggingface.co/spaces/nuprl/verbal-reasoning-challenge" rel="nofollow">https://huggingface.co/spaces/nuprl/verbal-reasoning-challen...</a>
Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...
As if the whole anti-intellectual hunt wasn’t enough, now PhD is a category implying holder of rote-knowledge at the highest level.
I guess it is hopeless to fight this, but a PhD is 100x more about the apprenticeship and real-world training as a scientist than any accumulated knowledge beyond ones prior training.<p>I know this is a rant, sorry, just so tired of the stupidity.
I have a set of independent benchmarks and most also show a difference between reasoning and non-reasoning models:<p>LLM Confabulation (Hallucination): <a href="https://github.com/lechmazur/confabulations/">https://github.com/lechmazur/confabulations/</a><p>LLM Step Game: <a href="https://github.com/lechmazur/step_game">https://github.com/lechmazur/step_game</a><p>LLM Thematic Generalization Benchmark: <a href="https://github.com/lechmazur/generalization">https://github.com/lechmazur/generalization</a><p>LLM Creative Story-Writing Benchmark: <a href="https://github.com/lechmazur/writing">https://github.com/lechmazur/writing</a><p>Extended NYT Connections LLM Benchmark: <a href="https://github.com/lechmazur/nyt-connections/">https://github.com/lechmazur/nyt-connections/</a><p>and a couple more that I haven't updated very recently.
The reasoning challenge is made of two parts:<p>1. Can you apply an existing model to a problem? For example: you're told how to multiply numbers and asked to multiply AHFG by VRBD in base-26 system.<p>2. Can you come up with a model that explains the given examples? For example: you're given 10 triples like AxB=C and asked to explain what they have in common.<p>Simply imitating answers won't get you very far.
If you want to have a problem that is fairly easy for humans but hard for LLMs it should have solution that requires iteratively applying same steps few times. Perhaps conditionally. I predict that LLMs even in chain-of-thought should drop the ball after just few iterations.