I love when people propose concrete claims like this: if they're wrong, they're disprovable. If they're right, you get unique and interesting insights from the attempts to disprove them.<p>I <i>suspect</i> these are all tokenization artifacts, but I'll probably take some time to try out the Conway's Game of Life problem by finetuning a model. A few issues I've noticed from the problems proposed in the article:<p>1. Wordle. This one TBH is a clear tokenization problem, not a proof of the reasoning capabilities of LLMs or lack thereof. LLMs are trained on multi-character tokens, and consume words as multi-character tokens: they don't "see" characters. Wordle is primarily a game based around splitting words into discrete characters, and LLMs can't see the characters they're supposed to operate on if you give them words — and depending on how you structure your answers, they also might not be able to see your answers! By breaking the words and answers into character-by-character sequences with spaces in between the characters (forcing the tokenizer into breaking each character into a separate token visible to the LLM), I successfully got GPT-4 to guess the word "BLAME" on my first attempt at playing Wordle with it: <a href="https://chat.openai.com/share/cc1569c4-44c3-4024-a0c2-eeb4988962ef" rel="nofollow">https://chat.openai.com/share/cc1569c4-44c3-4024-a0c2-eeb498...</a><p>2. Conway's Game of Life. Once again, the input sequences are given as a single, long string with no spacing, which will probably result in it being tokenized and thus partially invisible to the LLM. This one seems somewhat annoying to prompt, so I haven't tried yet, but I suspect a combination of better prompting and maybe finetuning would result in the LLM learning to solve the problem.<p>Similarly, complaints about finetuned models not being able to generalize well on input sequences of lengths longer than they were trained on are most likely token-related. Each token an LLM sees (both during training and inference) is encoded alongside its absolute position in the input sequence; while you as a human being see 1 and 1 1 and 1 1 1 as repeated series of 1s, an LLM would see those characters as being at least somewhat distinct. Given a synthetic dataset of a specific size, it can start to generalize over problems within the space that it sees, but if you give it new data outside of that context space, the new data will not be visible to the LLM as being necessarily related to what it was trained on. There are architectural tricks to get around it (e.g. RoPE scaling), but in general I wouldn't make generalizations about what models can or can't "reason" about based on using context window sizes the model didn't see during training: that's more about token-related blindspots and not about whether the model can be intelligent — at least, intelligent within the context window it's trained on.<p>One thing the author repeats several times throughout the article is that the mistakes LLMs make are far more instructive than their successes. However, I think in general this is not the case: if they <i>can</i> succeed sometimes, anyone who's spent much time finetuning knows that you can typically train them to succeed more reliably. And the mistakes here don't necessarily seem instructive at all: they're tokenization artifacts, and rewriting the problem to work around specific types of blindness (at least in Wordle's case) seems to allow the LLMs to succeed.<p>FWIW, the author brings up Victor Taelin's famous A::B problem; I believe I was the first to solve it [1] (albeit via finetuning, so ineligible for the $10k prize; although I did it before the prize was announced, just for the pleasure of playing around with an interesting problem). While I think that it's generally a useful insight to think of training as giving <i>more</i> intuition than intelligence, I do think the A::B problem getting solved eventually even by pure prompting shows that there's actually intelligence in there, too — it's not just intuition, or stochastic parroting of information from its training set. However, tokenization issues can easily get in the way of these kinds of problems if you're not aware of them (even in the winning Clause 3 Opus prompt slightly rephrased the problem to get it to work with the tokenizer), so the models actually can appear dumber than they really are.<p>1. <a href="https://twitter.com/reissbaker/status/1776531331562033453" rel="nofollow">https://twitter.com/reissbaker/status/1776531331562033453</a>