First off, I want to say this is kinda baffling to me, that this is some kind of novel "research", and it's published by Apple of all companies in the field. I could be more forgiving that some journalists try to sell it as "look, LLMs are incapable of logical reasoning!", because journalists always shout loud stupid stuff, otherwise they don't get paid, apparently. But still, it's kind of hard to justify the nature of this "advancement".<p>I mean, what is being described seems like super basic debug step for any real world system. This is kind of stuff not very advanced QA teams in boring banks do to test your super-boring not very advanced back-office bookkeeping systems. After this kind of testing reveals a number of bugs, you don't erase this bookkeeping system and conclude banking should be done manually on paper only, since computers are obviously incapable of making correct decisions, you fix these problems one by one, which sometimes means not just fixing a software bug, but revisioning the whole business-logic of the process. But this is, you know, routine.<p>So, not being aware of what are these benchmarks everyone uses to test LLM-products (please note, they are not testing LLMs as some kind of concept here, they are testing <i>products</i>), I would assume that OpenAI in particular, and any major company that released their own LLM product in the last couple of years in general, already does this super-obvious thing. But why this huge discovery happens now, then?<p>Well, obviously, there are 2 possibilities. Either none of them really do this, which sounds unbelievable - what all these high-paid genius researchers even do then? Or, more plausibly, they do, but not publish that. This one sounds reasonable, given there's no OpenAI, but AltmanAI, and all that stuff. Like, they compete to make a better general reasoning system, <i>of course</i> they don't want to reveal all their research.<p>But this doesn't really look reasonable to me (at least, at this very moment) given how basic the problem being discussed is. I mean, every school kid knows you shouldn't test on data you use for learning, so to be "peeking into answers when writing a test" only to make your product to perform slightly better on popular benchmarks seems super-cheap. I can understand when Qualcomm tweaks processors specifically to beat AnTuTu, but trying to beat problem-solving by improving your crawler to grab all tests on the internet is pointless. It seems, they should actively try to not contaminate their learning step by training on popular benchmarks. So what's going on? Are people working on these systems really that uncreative?<p>This said, all of it only applies to general approach, which is to say it's about what article <i>claims</i>, not what it <i>shows</i>. I personally am not convinced.<p>Let's take kiwi example. The whole argument is framed as if it's obvious that the model shouldn't have substracted these 5 kiwies. I don't know about that. Let's imagine, this is a real test, done by real kids. I guarantee you, the most (all?) of them would be rather confused by the wording. Like, what should we do with this information? Why was it included? Then, they will decide if they should or shouldn't substract the 5. I won't try to guess how many of them will, but the important thing is, they'll have to make this decision, and (hopefully) nobody will suddenly multiply the answer by 5 or do some meaningless shit like that.<p>And neither did LLMs in question, apparently.<p>In the end, these students will get the wrong answer, sure. But who decides if it's wrong? Well, of course, the teacher does. Why it's wrong? Well, "because it wasn't said you should discard small kiwies!" Great, man, you also didn't tell us we shouldn't discard them. This isn't a formal algebra problem, we are trying to use some common sense here.<p>In the end, it doesn't really matter, what teacher thinks the correct answer is, because it was just a stupid test. You may never really agree with him on this one, and it won't affect your life. Probably, you'll end up making more than him anyway, so here's your consolation.<p>So framing situations like this as a proof that LLM gets things objectively wrong just isn't right. It did subjectively wrong, judged by opinion of Apple researchers in question, and some other folks. Of course, this is what LLM development essentially is: doing whatever magic you deem necessary, to get it give more subjectively correct answers. And this returns it's to my first point: what is OpenAI's (Anthropic's, Meta's, etc) subjectively correct answer here? What is the end goal anyway? Why this "research" comes from "Apple researchers", not from one of these compenies' tech blogs?