科技回声

11 条评论

I don't think all this is needed to prove that LLMs aren't there yet.Here is a simple trivial one:"make ssh-keygen output decrypted version of a private key to another file"I'm pretty sure everyone on the LLM hypetrain will agree that just that prompt should be enough for GPT-4o to give a correct command. After all, it's SSH.However, here is the output command:<pre><code> ssh-keygen -p -f original_key -P "current_passphrase" -N "" -m PEM -q -C "decrypted key output" > decrypted_key chmod 600 decrypted_key </code></pre> Even the basic fact that ssh-keygen is an in-place tool and does not write data to stdout is not captured strongly enough in the representation for it to be activated with this prompt. Thus, it also overwrites the existing key, and your decrypted_key file will contain "your identification has been saved with the new passphrase", lol.Maybe we should set up a cron job - sorry, chatgpt task - to auto-tweet this in reply to all of the openai employees' hype tweets.Edit:chat link: <a href="https://chatgpt.com/share/67962739-f04c-800a-a56e-0c2fc8c2ddf8" rel="nofollow">https://chatgpt.com/share/67962739-f04c-800a-a56e-0c2fc8c2dd...</a>Edit 2: Tried it on deepseekThe prompt pasted as is, it gave the same wrong answer: <a href="https://imgur.com/jpVcFVP" rel="nofollow">https://imgur.com/jpVcFVP</a>However, with reasoning enabled, it caught the fact that the original file is overwritten in its chain of thought, and then gave the correct answer. Here is the relevant part of the chain of thought in a pastebin: <a href="https://pastebin.com/gG3c64zD" rel="nofollow">https://pastebin.com/gG3c64zD</a>And the correct answer:<pre><code> cp encrypted_key temp_key && \ ssh-keygen -p -f temp_key -m pem -N '' && \ mv temp_key decrypted_key </code></pre> I find it quite interesting that this seemingly 2020-era LLM problem is only correctly solved on the latest reasoning model, but cool that it works.

评论 #42830252 未加载

评论 #42829757 未加载

评论 #42829792 未加载

danparsonson4 个月前

Is it reasonable to imagine that LLMs should be able to play chess? I feel like we're expending a whole lot of effort trying to distort a screwdriver until it looks like a spanner, and then wondering why it won't grip bolts very well.Why should a language model be good at chess or similar numerical/analytical tasks?In what way does language resemble chess?

评论 #42829700 未加载

评论 #42831398 未加载

评论 #42834617 未加载

评论 #42829720 未加载

SkiFire134 个月前

Continuing asking the same puzzle to every LLM seems flawed to me, since LLMs may eventually "learn" the answer for this particular puzzle just because it has been seen somewhere, without actually becoming able to answer other chess-related puzzles in way that at least follows the chess rules.

评论 #42830788 未加载

echoangle4 个月前

What’s the solution? I’m a human and I can’t come up with a specific move I would call „the correct move“.

评论 #42829715 未加载

评论 #42829475 未加载

评论 #42829435 未加载

emmelaich4 个月前

My goto is ask LLM to explain this poem (this is but on variation) ...<pre><code> Spring has sprung, the grass iz riz, I wonder where da boidies iz? Da boid iz on da wing! Ain’t that absoid? I always hoid da wing...wuz on da boid! </code></pre> chatgpt up to o1 failed, o1 did very well. deepseek-r1 7b did ok too.

评论 #42829393 未加载

评论 #42830342 未加载

randomtoast4 个月前

If you input your own amateur-level chess games in PGN format and consult o1 about them, it provides a surprisingly decent high-level game analysis. It identifies the opening played, determines the pawn structures, comments on possible middlegame plans, highlights mistakes, and can quiet accurately assess whether an endgame is won or lost. Of course, as always, you should review o1's output and verify the statements, but there can be surprisingly (chess coach like) hints, insights, and improvement ideas that stockfish-only analysis cannot offer.

评论 #42829572 未加载

andix4 个月前

This test is worthless in a few weeks, it's now going into the training data. Even repeatedly posting it into LLM services (with analytics enabled) could lead to inclusion in the training data.

评论 #42833218 未加载

mettamage4 个月前

Just from a personal layman's perspective: I find being able to reason about chess moves a fair way to measure a specific type of reasoning. The fact that LLMs aren't good at this show to me that they're doing something else which to me is equally disappointing as it is _interesting_.

评论 #42829433 未加载

评论 #42829842 未加载

beyondCritics4 个月前

1..c1Q?? 2.Ba3+ loses on the spot, hence it is deduced easily that 1..c1N+! must be the best try. Consulting an online table base [1] shows that the resulting position is a ”blessed draw”: White can still mate, but not in time to evade a 50 move draw claim.[1] <a href="https://syzygy-tables.info/?fen=8/6B1/8/8/B7/8/K1pk4/8_b_-_-_0_1" rel="nofollow">https://syzygy-tables.info/?fen=8/6B1/8/8/B7/8/K1pk4/8_b_-_-...</a>

Kim_Bruning4 个月前

A while ago, dynomight went in and tried to explain what's going on, and why LLMs are bad at chess.See: <a href="https://dynomight.net/more-chess/" rel="nofollow">https://dynomight.net/more-chess/</a>HN Discussion of that article: <a href="https://news.ycombinator.com/item?id=42206817">https://news.ycombinator.com/item?id=42206817</a>(Edits: It turns out "it's complicated").

Waterluvian4 个月前

Is there a standard way to read the puzzles if it doesn’t tell you whose move it is? Also which direction is each player moving in? Does pawn move up or is pawn in home position?

评论 #42829925 未加载

11 条评论

porridgeraisin4 个月前

评论 #42830252 未加载

评论 #42829757 未加载

评论 #42829792 未加载

danparsonson4 个月前

评论 #42829700 未加载

评论 #42831398 未加载

评论 #42834617 未加载

评论 #42829720 未加载

SkiFire134 个月前

评论 #42830788 未加载

echoangle4 个月前

What’s the solution? I’m a human and I can’t come up with a specific move I would call „the correct move“.

评论 #42829715 未加载

评论 #42829475 未加载

评论 #42829435 未加载

emmelaich4 个月前

评论 #42829393 未加载

评论 #42830342 未加载

randomtoast4 个月前

评论 #42829572 未加载

andix4 个月前

This test is worthless in a few weeks, it's now going into the training data. Even repeatedly posting it into LLM services (with analytics enabled) could lead to inclusion in the training data.

评论 #42833218 未加载

mettamage4 个月前

评论 #42829433 未加载

评论 #42829842 未加载

beyondCritics4 个月前

Kim_Bruning4 个月前

Waterluvian4 个月前

Is there a standard way to read the puzzles if it doesn’t tell you whose move it is? Also which direction is each player moving in? Does pawn move up or is pawn in home position?

评论 #42829925 未加载

I ask this chess puzzle to every new LLM

11 条评论

I ask this chess puzzle to every new LLM

11 条评论