PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

174 pointsby enum4 months ago

14 comments

XCabbage4 months ago

This doesn't feel like a "reasoning" challenge. The mental skill required to solve most of these seems to be the ability to loop over all known members of a category like "popular brand names" or "well-known actors" and see if they fit the clue.As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)

评论 #42994880 未加载

评论 #42994963 未加载

评论 #42993230 未加载

rahidz4 months ago

What is so interesting to me is that the reasoning traces for these often have the correct answer, but the model fails to realize it.Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" – no, too long. Wait, three letters each. Let me think again. Maybe "hay fever" is two words but not three letters each. Maybe "dry eye"? "Dry" and "eye" – both three letters. "Dry eye" is a condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) – no, they don't rhyme. "Eye" is pronounced like "i", while "dry" is "d-rye". Not the same ending."Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is changed to next letter, maybe the original word is "footnot" (but that's not a word). Alternatively, maybe "foot" + "note", but "note" isn't a body part."

评论 #42994610 未加载

评论 #42994801 未加载

评论 #42994816 未加载

评论 #42996025 未加载

mkoubaa4 months ago

The term "PhD Knowledge" makes me cringe. PhD is about contribution to a field, not a standardized test score or a weird status-bearing substitute for IQ.

评论 #42993707 未加载

评论 #42993914 未加载

评论 #42994021 未加载

评论 #42994009 未加载

评论 #42994336 未加载

windsignaling4 months ago

I'm not a fan of these "gotchas" because they don't test for what we really care about.Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.

评论 #43001655 未加载

评论 #42998204 未加载

评论 #42998917 未加载

评论 #43000857 未加载

enum4 months ago

Results and dataset explorer here: <a href="https://huggingface.co/spaces/nuprl/verbal-reasoning-challenge" rel="nofollow">https://huggingface.co/spaces/nuprl/verbal-reasoning-challen...</a>

评论 #42992817 未加载

评论 #42993866 未加载

sega_sai4 months ago

Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...

评论 #42992660 未加载

lokimedes4 months ago

As if the whole anti-intellectual hunt wasn’t enough, now PhD is a category implying holder of rote-knowledge at the highest level. I guess it is hopeless to fight this, but a PhD is 100x more about the apprenticeship and real-world training as a scientist than any accumulated knowledge beyond ones prior training.I know this is a rant, sorry, just so tired of the stupidity.

评论 #42994227 未加载

评论 #42994041 未加载

评论 #42996455 未加载

zone4114 months ago

I have a set of independent benchmarks and most also show a difference between reasoning and non-reasoning models:LLM Confabulation (Hallucination): <a href="https://github.com/lechmazur/confabulations/">https://github.com/lechmazur/confabulations/</a>LLM Step Game: <a href="https://github.com/lechmazur/step_game">https://github.com/lechmazur/step_game</a>LLM Thematic Generalization Benchmark: <a href="https://github.com/lechmazur/generalization">https://github.com/lechmazur/generalization</a>LLM Creative Story-Writing Benchmark: <a href="https://github.com/lechmazur/writing">https://github.com/lechmazur/writing</a>Extended NYT Connections LLM Benchmark: <a href="https://github.com/lechmazur/nyt-connections/">https://github.com/lechmazur/nyt-connections/</a>and a couple more that I haven't updated very recently.

akomtu4 months ago

The reasoning challenge is made of two parts:1. Can you apply an existing model to a problem? For example: you're told how to multiply numbers and asked to multiply AHFG by VRBD in base-26 system.2. Can you come up with a model that explains the given examples? For example: you're given 10 triples like AxB=C and asked to explain what they have in common.Simply imitating answers won't get you very far.

zinccat4 months ago

My feeling is that a lot of challenge could come from the tokenizer used by the model, similar to r in strawberry problem.

评论 #42993033 未加载

scotty794 months ago

If you want to have a problem that is fairly easy for humans but hard for LLMs it should have solution that requires iteratively applying same steps few times. Perhaps conditionally. I predict that LLMs even in chain-of-thought should drop the ball after just few iterations.

评论 #43010238 未加载

brokensegue4 months ago

Are these really reasoning challenges? Seems like they are really solved via brute force or guess and check

评论 #42995650 未加载

bryan04 months ago

Are LLMs not trained on NPR transcripts?

aghilmort4 months ago

really great work! are you a co-author?

评论 #42992624 未加载