30% drop in O1-preview accuracy when Putnam problems are slightly variated

560 点作者 optimalsolver5 个月前

40 条评论

jerf5 个月前

I remember when this stuff was all coming out and people were finally excited about ChatGPT getting the problem with "which is heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?" problem correct. But of course it got it correct. It was in the training set. Vary the problem slightly by just changing the nouns, or changing the numbers so that one in fact was heavier than the other, and performance went all over the map.I just went to chatgpt.com and put into the chat box "Which is heavier, a 9.99-pound back of steel ingots or a 10.01 bag of fluffy cotton?", and the very first answer I got (that is, I didn't go fishing here) was<pre><code> The 9.99-pound bag of steel ingots is heavier than the 10.01-pound bag of fluffy cotton by a small margin. Although the cotton may appear larger due to its fluffy nature, the steel ingots are denser and the weight of the steel bag is 9.99 pounds compared to the 10.01 pounds of cotton. So, the fluffy cotton weighs just a tiny bit more than the steel ingots. </code></pre> Which, despite getting it both right and wrong, must still be graded as a "fail".If you want to analyze these thing for their true capability, you need to make sure you're out of the training set... and most of the things that leap to your mind in 5 seconds are leaping to your mind precisely because they are either something you've seen quite often or something that you can easily think of and therefore many other people have easily thought of them as well. Get off the beaten path a bit and the math gets much less impressive.

评论 #42567195 未加载

评论 #42566884 未加载

评论 #42567324 未加载

评论 #42569400 未加载

评论 #42568177 未加载

评论 #42567332 未加载

评论 #42569911 未加载

评论 #42566995 未加载

评论 #42567918 未加载

评论 #42569394 未加载

评论 #42568252 未加载

评论 #42566984 未加载

评论 #42575225 未加载

评论 #42569204 未加载

评论 #42570229 未加载

评论 #42571872 未加载

评论 #42569939 未加载

评论 #42568644 未加载

评论 #42569450 未加载

评论 #42567225 未加载

评论 #42569152 未加载

评论 #42566982 未加载

评论 #42571060 未加载

评论 #42567929 未加载

评论 #42568377 未加载

评论 #42569661 未加载

评论 #42569163 未加载

评论 #42568565 未加载

wim5 个月前

One experiment I would love to see, although not really feasible in practice, is to train a model on all digitized data from before the year 1905 (journals, letters, books, broadcasts, lectures, the works), and then ask it for a formula for mass-energy equivalence. A certain answer would definitely settle the debate on whether pattern recognition is a form of intelligence ;)

评论 #42566503 未加载

评论 #42566133 未加载

评论 #42566191 未加载

评论 #42566066 未加载

评论 #42566279 未加载

评论 #42567652 未加载

评论 #42566486 未加载

评论 #42566800 未加载

评论 #42566435 未加载

评论 #42566021 未加载

评论 #42571681 未加载

评论 #42566450 未加载

评论 #42566056 未加载

rrr_oh_man5 个月前

Performance of these LLMs on real life tasks feels very much like students last-minute cramming for Asian style exams.The ability to perfectly regurgitate, while no concept of meaning.

评论 #42566327 未加载

评论 #42565834 未加载

评论 #42565797 未加载

dogcomplex5 个月前

I have a feeling the fact you're only slightly varying the input means the model is falling back into the question it was expecting and getting things wrong as a result. If you just varied it a little more and added some general-purpose prompt-fu like:"First break the problem down into known facts, then pull relevant world knowledge, then bring it all together to assess the problem from multiple angles and make a conclusion. Do not immediately just use the first obvious conclusion."You're gonna get a lot better responses. I suspect this is more of a "look! LLMs make bad kneejerk responses when we try to trick them from what they were expecting!" rather than "Look! They aren't even smart reasoners, they can't even figure out these problems without memorizing!"They do memorize. But that cuts both ways - making problems very close to the memorized one mess with their perception, the same way humans will instinctually respond to something that looks like a face before stepping back and assessing.

orange_puff5 个月前

This is very interesting, but a couple of things to note; 1. o1 still achieves > 40% on the varied Putnam problems, which is still a feat most math students would not achieve. 2. o3 solved 25% of the Epoch AI dataset. - There was an interesting post which calls into question how difficult some of those problems actually are, but it still seems very impressive.I think a fair conclusion here is reasoning models are still really good at solving very difficult math and competitive programming problems, but just better at ones they have seen before.

评论 #42574947 未加载

yifanl5 个月前

Is it just an open secret that the models are currently just being hardcoded for random benchmarks? Seems weird that people would be asking Putnam problems to a chatbot :/

评论 #42566629 未加载

评论 #42565749 未加载

评论 #42566725 未加载

评论 #42568066 未加载

ankit2195 个月前

They are highly effective pattern matchers. You change the pattern, it won't work. I don't remember who, but most likely @tszzl (roon), commented on x that they still trained the traditional way, and there is no test time compute (TTC) or Montecarlo Tree search (like Alpha Go) in o1 or o3. If that is true, then it's still predicting the next word based on it's training data. Likely to follow the most probable path - which comes directly from the training itself - even for the slight variations. Encouragingly, if TTC hasnt been explored, there is a long runway for performance improvements.The other reason this seems hard to guess is because we don't know how much of what we are asking is in the training data. It would perform on some tasks, while fail at others even though those are similar.

评论 #42566068 未加载

评论 #42566034 未加载

评论 #42567619 未加载

pama5 个月前

This workshop contribution is OK, and the benchmark is somewhat valuable even without the rephrasing part of the problems, but the rephrasing (of only a small number of problems) sometimes genuinely makes the problem more confusing to humans as well by either poor phrasing (fig 3), or unneeded breaking of convention (fig 4; points in 2D are often P, with coordinates x,y). It would have been nice to see effects on the rephrasing of the latest/post-training date problems as a function of the increased noising, to delineate part of this confusion. I wonder how much better o3 is on the same benchmark.Also, the correct title of this contribution is: Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning

e1g5 个月前

The paper includes several examples of their modified questions. There has been a substantial jump from o1-preview to o1, so I gave several samples to o1 and o1-pro (not o1-preview), and current o1s gave the correct answer to those modified problems. SOTA changes fast.

评论 #42566496 未加载

评论 #42566267 未加载

whimsicalism5 个月前

So many negative comments as if o3 didn’t get 25% on frontiermath - which is absolutely nuts.Sure, LLMs will perform better if the answer to a problem is directly in their training set. But that doesn’t mean they perform bad when the answer isn’t in their training set.

评论 #42566404 未加载

评论 #42567057 未加载

bary125 个月前

I don't get why this matters at all? I looked at the o1-preview paper, Putnam is not mentioned. Meaning that: a) OpenAI never claimed this model achieves X% on this dataset. b) likely, OpenAI did not take measures to exclude this dataset from training. Meaning the only conclusion we can draw from this result is: when prompted with questions that were verbarim in the dataset, performance increases dramatically. We already know this, and it doesn't say anything about the performance of the model on unseen problems.

评论 #42570284 未加载

frikskit5 个月前

An interesting example of this is:There are 6 “a”s in the sentence: “How many ‘a’ in this sentence?”<a href="https://chatgpt.com/share/677582a9-45fc-8003-8114-edd2e6efa289" rel="nofollow">https://chatgpt.com/share/677582a9-45fc-8003-8114-edd2e6efa2...</a>Whereas the typical “strawberry” variant is now correct.There are 3 “r”s in the word “strawberry.”Clearly the lesson wasn’t learned, the model was just trained on people highlighting this failure case.

评论 #42568124 未加载

评论 #42570064 未加载

评论 #42568325 未加载

评论 #42569413 未加载

Topfi5 个月前

I wouldn’t be surprised if similar will be found concerning the ARC challenge and it is why I still maintain my own private LLM challenges to gauge current capabilities. Course, I have little illusion that these are fully private, but it is better than fully public tests.Even the most straight forward, logical, easily reasoned ones stump all LLMs I have access to, which is why I am so skeptical concerning emergence, reasoning and all this hype around “AGI”…

curious_cat_1635 个月前

The metaphor that might describe this paper is "iteration". I'd hazard to predict that we’ll likely see more iterations of the following loop in 2025:-> A new benchmark emerges with a novel evaluation method.-> A new model saturates the benchmark by acquiring the novel “behavior.”-> A new benchmark introduces yet another layer of novelty.-> Models initially fail until a lab discovers how to acquire the new behavior.Case in point: OpenAI tackled this last step by introducing a paradigm called deliberative alignment to tackle some of the ARC benchmarks. [1]Alongside all this technical iteration, there’s a parallel cycle of product iteration, aiming to generate $ by selling intelligent software. The trillion $ questions are around finding the right iterations on both technical and product dimensions.[1] <a href="https://openai.com/index/deliberative-alignment/" rel="nofollow">https://openai.com/index/deliberative-alignment/</a>

atleastoptimal5 个月前

ok but preview sucks, run it on o1 pro.99% of studies claiming some out of distribution failure of an LLM uses a model already made irrelevant by SOTA. These kinds of studies, with long throughputs and review periods, are not the best format to make salient points given the speed at which the SOTA horizon progresses

评论 #42567841 未加载

tomlockwood5 个月前

One I just did:Q: I was heading to Aberdeen from London. On my way I passed seven wives, each wife had seven sacks, and in each sack there were seven cats and each cat had seven fish. How many were going to London?A: This riddle is a play on words, and the answer is hidden in the phrasing! You mentioned you were heading to Aberdeen from London, but you didn’t say anything about the seven wives, their sacks, cats, or fish actually being headed to London.The only one going to London is you.So the answer is: 1 person (you) are going to London.

评论 #42571930 未加载

lupire5 个月前

The researcher's answer to their variant of "Year: 2016 ID: A1" in the appendix is wrong.The solution (sum of 1,2,5,6,9,10,13,14, ...) has an alternating pattern, so has to be two piecewise interleaved polynomials, which cannot be expressed as a single polyomial.Their answer works for k=1,2, but not k=3.<a href="https://openreview.net/pdf?id=YXnwlZe0yf" rel="nofollow">https://openreview.net/pdf?id=YXnwlZe0yf</a>This does not give me confidence in the results of their paper.

评论 #42566579 未加载

评论 #42567001 未加载

评论 #42566923 未加载

ben_w5 个月前

Link title says "slightly", but the PDF says two different kinds of variations: variable names (slight) and problem constants (significant), and the 30% drop is on the combination of a 26 variable and also 26 variable + constant questions.It's good to have a better test (though I bet this one will also be quickly saturated like all the others), but the title here doesn't seem justified by the page title there or the content.

评论 #42566403 未加载

huitzitziltzin5 个月前

This result is the same as a recent test of the same method+hypothesis from a group at Apple, no? I don’t have that reference handy but I don’t think I’m making it up.

评论 #42565777 未加载

chvid5 个月前

Isn't this simply because the dataset used (Putnam-AXIOM Original) is in the training data used to train the various models?Given that these are simple variations (variable names and constants value change in math problems). Why would the companies creating these models (OpenAI etc.) create these variations themselves in order to insure that the model is learning how to solve the problem rather than memorize a solution? Seems like a very obvious thing to do ...

评论 #42566755 未加载

deegles5 个月前

I still find it hard to believe that LLM methods will lead to "true" AI. No amount of processing power or data will be sufficient without something new.

WiSaGaN5 个月前

I don't think this proves that the LLM is just "pattern matcher". Human makes similar mistakes too, especially when under time pressure (similar to non-reasoning model that needs to "use system one" to generate answer on one go). This is further evident that if you specifically ask the models to pay attention to traps, or just ask follow up question "are you sure?", then they usually can get it right.

评论 #42565991 未加载

评论 #42565965 未加载

obblekk5 个月前

I hope someone reruns this on o1 and eventually o3.If o1-preview was the start like gpt1, then we should expect generalization to increase quickly.

评论 #42565796 未加载

评论 #42565823 未加载

itfossil5 个月前

Oh so its almost like everything else AI related, they basically cheated and lied.If you are shocked by this, you are the sucker in the room.

评论 #42565987 未加载

mik094 个月前

sometimes o1-preview start hallucinating halfway through a good solution. it can get the intuition and the 'main' direction for a problem wrong too. but then problem solving is just a series of rephrasings and translating into different math domains is used by mahtematicians for solving problems.

golol5 个月前

Hmmm without a human control it is not all that clear to me that the variation problems are not more difficult.

WiSaGaN5 个月前

There is also a curated benchmark just for those famous problems slightly variated: <a href="https://github.com/cpldcpu/MisguidedAttention/tree/main/eval">https://github.com/cpldcpu/MisguidedAttention/tree/main/eval</a>

评论 #42566012 未加载

scotty795 个月前

I think that lamentations about real world data running out is misplaced. We can multiply data with slight variations which might lead to better resilience and more accurate model's responses for novel problems.

m3kw95 个月前

It still needs to be prompted so it’s easy to understand. If you ask in a weird “how do I not not not win” instead of “ how do I lose” you are gonna run into problems

steveBK1235 个月前

Yes so when you change the sequence of tokens they've electronically memorized, they get a bit worse at predicting the next token?

评论 #42566791 未加载

matt32104 个月前

If the models are made to pass the benchmarks, of course they’d have some sort of overfit.

lazycog5124 个月前

i know it's easier to just drop the internet tarball on the transformer, but at some point we need to just be giving these models grade school math homework and times tables.

rubymamis5 个月前

I would love to see how well Deepseek V3 do on this.

评论 #42565778 未加载

youworkwepay5 个月前

Or it's time to step back and call it what it is - very good pattern recognition.I mean, that's cool... we can get a lot of work done with pattern recognition. Most of the human race never really moves above that level of thinking in the workforce or navigating their daily life, especially if they default to various societally prescribed patterns of getting stuff done (eg. go to college or the military <Based on <these criteria>, find a job <based on the best fit with <this list of desirable skills & experiences>, go to <these places> to find love....)

评论 #42566010 未加载

评论 #42565825 未加载

PunchTornado5 个月前

isn't it weird that they didn't test gemini?

retinaros5 个月前

trained on test. who even trusts OAI anymore?

评论 #42566370 未加载

sirolimus5 个月前

Yea no shit. LLMs are just REALLY good guessers. People gotta stop the hype lol.Using LLMs for anything serious and which requires consistency and trustworthiness without hallucinations is irresponsible and ridiculous.Closed source LLMs are a bubble and a joke.

paradox2425 个月前

MY GOD ITS ALMOST LIKE THEY TRAINED THE MODEL FOR THE BENCHMARK INSTEAD OF FOR GENERAL APTITUDE! WHY WOULD OPENAI DO THIS?!?

lomkju5 个月前

Even humans get confused with trick questions right? Once they understand this is a trick question they no longer fall for it. :)

aucisson_masque5 个月前

It drops from 50 to 33,96. Still the best, o1 on variable is around 2 times better than Claude on original test.The rest of the llm are far away, single digit.It makes me wonder if o1 is finally getting intelligent? LLM are not supposed to understand these problems when you change variable and values, they have to rely on preexisting data of absolutely identical solved problem to give a correct answer.I didn't follow LLM development but I heard one times that chatgpt is now composed of multiple LLM and maybe they put multiple artificial intelligence with purpose of problems solvings or trigonometry for instance.That would explain the reason it's so much better.

40 条评论

jerf5 个月前

评论 #42567195 未加载

评论 #42566884 未加载

评论 #42567324 未加载

评论 #42569400 未加载

评论 #42568177 未加载

评论 #42567332 未加载

评论 #42569911 未加载

评论 #42566995 未加载

评论 #42567918 未加载

评论 #42569394 未加载

评论 #42568252 未加载

评论 #42566984 未加载

评论 #42575225 未加载

评论 #42569204 未加载

评论 #42570229 未加载

评论 #42571872 未加载

评论 #42569939 未加载

评论 #42568644 未加载

评论 #42569450 未加载

评论 #42567225 未加载

评论 #42569152 未加载

评论 #42566982 未加载

评论 #42571060 未加载

评论 #42567929 未加载

评论 #42568377 未加载

评论 #42569661 未加载

评论 #42569163 未加载

评论 #42568565 未加载

wim5 个月前

评论 #42566503 未加载

评论 #42566133 未加载

评论 #42566191 未加载

评论 #42566066 未加载

评论 #42566279 未加载

评论 #42567652 未加载

评论 #42566486 未加载

评论 #42566800 未加载

评论 #42566435 未加载

评论 #42566021 未加载

评论 #42571681 未加载

评论 #42566450 未加载

评论 #42566056 未加载

rrr_oh_man5 个月前

Performance of these LLMs on real life tasks feels very much like students last-minute cramming for Asian style exams.The ability to perfectly regurgitate, while no concept of meaning.

评论 #42566327 未加载

评论 #42565834 未加载

评论 #42565797 未加载

dogcomplex5 个月前

orange_puff5 个月前

评论 #42574947 未加载

yifanl5 个月前

Is it just an open secret that the models are currently just being hardcoded for random benchmarks? Seems weird that people would be asking Putnam problems to a chatbot :/

评论 #42566629 未加载

评论 #42565749 未加载

评论 #42566725 未加载

评论 #42568066 未加载

ankit2195 个月前

评论 #42566068 未加载

评论 #42566034 未加载

评论 #42567619 未加载

pama5 个月前

e1g5 个月前

评论 #42566496 未加载

评论 #42566267 未加载

whimsicalism5 个月前

评论 #42566404 未加载

评论 #42567057 未加载

bary125 个月前

评论 #42570284 未加载

frikskit5 个月前

评论 #42568124 未加载

评论 #42570064 未加载

评论 #42568325 未加载

评论 #42569413 未加载

Topfi5 个月前

curious_cat_1635 个月前

atleastoptimal5 个月前

评论 #42567841 未加载

tomlockwood5 个月前

评论 #42571930 未加载

lupire5 个月前

评论 #42566579 未加载

评论 #42567001 未加载

评论 #42566923 未加载

ben_w5 个月前

评论 #42566403 未加载

huitzitziltzin5 个月前

This result is the same as a recent test of the same method+hypothesis from a group at Apple, no? I don’t have that reference handy but I don’t think I’m making it up.

评论 #42565777 未加载

chvid5 个月前

评论 #42566755 未加载

deegles5 个月前

I still find it hard to believe that LLM methods will lead to "true" AI. No amount of processing power or data will be sufficient without something new.

WiSaGaN5 个月前

评论 #42565991 未加载

评论 #42565965 未加载

obblekk5 个月前

I hope someone reruns this on o1 and eventually o3.If o1-preview was the start like gpt1, then we should expect generalization to increase quickly.

评论 #42565796 未加载

评论 #42565823 未加载

itfossil5 个月前

Oh so its almost like everything else AI related, they basically cheated and lied.If you are shocked by this, you are the sucker in the room.

评论 #42565987 未加载

mik094 个月前

golol5 个月前

Hmmm without a human control it is not all that clear to me that the variation problems are not more difficult.

WiSaGaN5 个月前

评论 #42566012 未加载

scotty795 个月前

m3kw95 个月前

It still needs to be prompted so it’s easy to understand. If you ask in a weird “how do I not not not win” instead of “ how do I lose” you are gonna run into problems

steveBK1235 个月前

Yes so when you change the sequence of tokens they've electronically memorized, they get a bit worse at predicting the next token?

评论 #42566791 未加载

matt32104 个月前

If the models are made to pass the benchmarks, of course they’d have some sort of overfit.

lazycog5124 个月前

i know it's easier to just drop the internet tarball on the transformer, but at some point we need to just be giving these models grade school math homework and times tables.

rubymamis5 个月前

I would love to see how well Deepseek V3 do on this.

评论 #42565778 未加载

youworkwepay5 个月前

评论 #42566010 未加载

评论 #42565825 未加载

PunchTornado5 个月前

isn't it weird that they didn't test gemini?

retinaros5 个月前

trained on test. who even trusts OAI anymore?

评论 #42566370 未加载

sirolimus5 个月前

paradox2425 个月前

MY GOD ITS ALMOST LIKE THEY TRAINED THE MODEL FOR THE BENCHMARK INSTEAD OF FOR GENERAL APTITUDE! WHY WOULD OPENAI DO THIS?!?

lomkju5 个月前

Even humans get confused with trick questions right? Once they understand this is a trick question they no longer fall for it. :)

aucisson_masque5 个月前