I don't think all this is needed to prove that LLMs aren't there yet.<p>Here is a simple trivial one:<p>"make ssh-keygen output decrypted version of a private key to another file"<p>I'm pretty sure everyone on the LLM hypetrain will agree that just that prompt should be enough for GPT-4o to give a correct command. After all, it's SSH.<p>However, here is the output command:<p><pre><code> ssh-keygen -p -f original_key -P "current_passphrase" -N "" -m PEM -q -C "decrypted key output" > decrypted_key
chmod 600 decrypted_key
</code></pre>
Even the basic fact that ssh-keygen is an in-place tool and does not write data to stdout is not captured strongly enough in the representation for it to be activated with this prompt. Thus, it also overwrites the existing key, and your decrypted_key file will contain "your identification has been saved with the new passphrase", lol.<p>Maybe we should set up a cron job - sorry, chatgpt task - to auto-tweet this in reply to all of the openai employees' hype tweets.<p>Edit:<p>chat link: <a href="https://chatgpt.com/share/67962739-f04c-800a-a56e-0c2fc8c2ddf8" rel="nofollow">https://chatgpt.com/share/67962739-f04c-800a-a56e-0c2fc8c2dd...</a><p>Edit 2: Tried it on deepseek<p>The prompt pasted as is, it gave the same wrong answer: <a href="https://imgur.com/jpVcFVP" rel="nofollow">https://imgur.com/jpVcFVP</a><p>However, with reasoning enabled, it caught the fact that the original file is overwritten in its chain of thought, and then gave the correct answer. Here is the relevant part of the chain of thought in a pastebin: <a href="https://pastebin.com/gG3c64zD" rel="nofollow">https://pastebin.com/gG3c64zD</a><p>And the correct answer:<p><pre><code> cp encrypted_key temp_key && \
ssh-keygen -p -f temp_key -m pem -N '' && \
mv temp_key decrypted_key
</code></pre>
I find it quite interesting that this seemingly 2020-era LLM problem is only correctly solved on the latest reasoning model, but cool that it works.
Is it reasonable to imagine that LLMs <i>should</i> be able to play chess? I feel like we're expending a whole lot of effort trying to distort a screwdriver until it looks like a spanner, and then wondering why it won't grip bolts very well.<p>Why should a <i>language model</i> be good at chess or similar numerical/analytical tasks?<p>In what way does language resemble chess?
Continuing asking the same puzzle to every LLM seems flawed to me, since LLMs may eventually "learn" the answer for this particular puzzle just because it has been seen somewhere, without actually becoming able to answer other chess-related puzzles in way that at least follows the chess rules.
My goto is ask LLM to explain this poem (this is but on variation) ...<p><pre><code> Spring has sprung, the grass iz riz,
I wonder where da boidies iz?
Da boid iz on da wing! Ain’t that absoid?
I always hoid da wing...wuz on da boid!
</code></pre>
chatgpt up to o1 failed, o1 did very well. deepseek-r1 7b did ok too.
If you input your own amateur-level chess games in PGN format and consult o1 about them, it provides a surprisingly decent high-level game analysis. It identifies the opening played, determines the pawn structures, comments on possible middlegame plans, highlights mistakes, and can quiet accurately assess whether an endgame is won or lost. Of course, as always, you should review o1's output and verify the statements, but there can be surprisingly (chess coach like) hints, insights, and improvement ideas that stockfish-only analysis cannot offer.
This test is worthless in a few weeks, it's now going into the training data. Even repeatedly posting it into LLM services (with analytics enabled) could lead to inclusion in the training data.
Just from a personal layman's perspective: I find being able to reason about chess moves a fair way to measure a specific type of reasoning. The fact that LLMs aren't good at this show to me that they're doing something else which to me is equally disappointing as it is _interesting_.
1..c1Q?? 2.Ba3+ loses on the spot, hence it is deduced easily that 1..c1N+! must be the best try. Consulting an online table base [1] shows that the resulting position is a ”blessed draw”: White can still mate, but not in time to evade a 50 move draw claim.<p>[1] <a href="https://syzygy-tables.info/?fen=8/6B1/8/8/B7/8/K1pk4/8_b_-_-_0_1" rel="nofollow">https://syzygy-tables.info/?fen=8/6B1/8/8/B7/8/K1pk4/8_b_-_-...</a>
A while ago, dynomight went in and tried to explain what's going on, and why LLMs are bad at chess.<p>See: <a href="https://dynomight.net/more-chess/" rel="nofollow">https://dynomight.net/more-chess/</a><p>HN Discussion of that article: <a href="https://news.ycombinator.com/item?id=42206817">https://news.ycombinator.com/item?id=42206817</a><p>(Edits: It turns out "it's complicated").
Is there a standard way to read the puzzles if it doesn’t tell you whose move it is? Also which direction is each player moving in? Does pawn move up or is pawn in home position?